Connecting the dots between audio and text without parallel data through visual knowledge transfer

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-
the-art performance in action recognition, video-text tasks, and video-centric dialogue. Our …

被引用次数：33 相关文章所有 3 个版本

[PDF] thecvf.com

Fusing pre-trained language models with multimodal prompts through reinforcement learning

Y Yu, J Chung, H Yun, J Hessel… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Language models are capable of commonsense reasoning: while domain-specific
models can learn from explicit knowledge (eg commonsense graphs [6], ethical norms [25]) …

被引用次数：15 相关文章所有 4 个版本

[PDF] arxiv.org

Contrastive audio-language learning for music

I Manco, E Benetos, E Quinton, G Fazekas - arXiv preprint arXiv …, 2022 - arxiv.org

As one of the most intuitive interfaces known to humans, natural language has the potential
to mediate many tasks that involve human-computer interaction, especially in application …

被引用次数：41 相关文章所有 8 个版本

[PDF] arxiv.org

Prefix tuning for automated audio captioning

M Kim, K Sung-Bin, TH Oh - ICASSP 2023-2023 IEEE …, 2023 - ieeexplore.ieee.org

Audio captioning aims to generate text descriptions from environmental sounds. One
challenge of audio captioning is the difficulty of the generalization due to the lack of audio …

被引用次数：37 相关文章所有 3 个版本

[PDF] arxiv.org

Multimodal knowledge alignment with reinforcement learning

Y Yu, J Chung, H Yun, J Hessel, JS Park, X Lu… - arXiv preprint arXiv …, 2022 - arxiv.org

Large language models readily adapt to novel settings, even without task-specific training
data. Can their zero-shot capacity be extended to multimodal inputs? In this work, we …

被引用次数：27 相关文章所有 3 个版本

[PDF] dcase.community

[PDF][PDF] The SJTU system for DCASE2022 challenge task 6: Audio captioning with audio-text retrieval pre-training

X Xu, Z Xie, M Wu, K Yu - Tech. Rep., DCASE2022 Challenge, 2022 - dcase.community

This technical report describes the system submitted to the Detection and Classification of
Acoustic Scenes and Events (DCASE) 2022 challenge Task 6. There are two involving …

被引用次数：34 相关文章所有 2 个版本

[PDF] thecvf.com

Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal Distillation

H Yun, J Na, G Kim - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

Sound can convey significant information for spatial reasoning in our daily lives. To endow
deep networks with such ability, we address the challenge of dense indoor prediction with …

被引用次数：5 相关文章所有 6 个版本

[PDF] arxiv.org

Cat: Causal audio transformer for audio classification

X Liu, H Lu, J Yuan, X Li - ICASSP 2023-2023 IEEE …, 2023 - ieeexplore.ieee.org

The attention-based Transformers have been increasingly applied to audio classification
because of their global receptive field and ability to handle long-term dependency. However …

被引用次数：19 相关文章所有 3 个版本

[PDF] arxiv.org

Blat: Bootstrapping language-audio pre-training based on audioset tag-guided synthetic data

X Xu, Z Zhang, Z Zhou, P Zhang, Z Xie, M Wu… - Proceedings of the 31st …, 2023 - dl.acm.org

Compared with ample visual-text pre-training research, few works explore audio-text pre-
training, mostly due to the lack of sufficient parallel audio-text data. Most existing methods …

被引用次数：10 相关文章所有 4 个版本

[PDF] neurips.cc

Towards effective multi-modal interchanges in zero-resource sounding object localization

Y Zhao, C Zhang, H Huang, H Li… - Advances in Neural …, 2022 - proceedings.neurips.cc

Aiming to locate the object that emits a specified sound in complex scenes, the task of
sounding object localization bridges two perception-oriented modalities of vision and …

被引用次数：5 相关文章所有 4 个版本

高级搜索

QQ 群