Internvideo2: Scaling video foundation models for multimodal video understanding

Y Wang, K Li, X Li, J Yu, Y He, G Chen, B Pei… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce InternVideo2, a new video foundation model (ViFM) that achieves the state-of-
the-art performance in action recognition, video-text tasks, and video-centric dialogue. Our …

Fusing pre-trained language models with multimodal prompts through reinforcement learning

Y Yu, J Chung, H Yun, J Hessel… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Language models are capable of commonsense reasoning: while domain-specific
models can learn from explicit knowledge (eg commonsense graphs [6], ethical norms [25]) …

Contrastive audio-language learning for music

I Manco, E Benetos, E Quinton, G Fazekas - arXiv preprint arXiv …, 2022 - arxiv.org
As one of the most intuitive interfaces known to humans, natural language has the potential
to mediate many tasks that involve human-computer interaction, especially in application …

Prefix tuning for automated audio captioning

M Kim, K Sung-Bin, TH Oh - ICASSP 2023-2023 IEEE …, 2023 - ieeexplore.ieee.org
Audio captioning aims to generate text descriptions from environmental sounds. One
challenge of audio captioning is the difficulty of the generalization due to the lack of audio …

Multimodal knowledge alignment with reinforcement learning

Y Yu, J Chung, H Yun, J Hessel, JS Park, X Lu… - arXiv preprint arXiv …, 2022 - arxiv.org
Large language models readily adapt to novel settings, even without task-specific training
data. Can their zero-shot capacity be extended to multimodal inputs? In this work, we …

[PDF][PDF] The SJTU system for DCASE2022 challenge task 6: Audio captioning with audio-text retrieval pre-training

X Xu, Z Xie, M Wu, K Yu - Tech. Rep., DCASE2022 Challenge, 2022 - dcase.community
This technical report describes the system submitted to the Detection and Classification of
Acoustic Scenes and Events (DCASE) 2022 challenge Task 6. There are two involving …

Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal Distillation

H Yun, J Na, G Kim - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
Sound can convey significant information for spatial reasoning in our daily lives. To endow
deep networks with such ability, we address the challenge of dense indoor prediction with …

Cat: Causal audio transformer for audio classification

X Liu, H Lu, J Yuan, X Li - ICASSP 2023-2023 IEEE …, 2023 - ieeexplore.ieee.org
The attention-based Transformers have been increasingly applied to audio classification
because of their global receptive field and ability to handle long-term dependency. However …

Blat: Bootstrapping language-audio pre-training based on audioset tag-guided synthetic data

X Xu, Z Zhang, Z Zhou, P Zhang, Z Xie, M Wu… - Proceedings of the 31st …, 2023 - dl.acm.org
Compared with ample visual-text pre-training research, few works explore audio-text pre-
training, mostly due to the lack of sufficient parallel audio-text data. Most existing methods …

Towards effective multi-modal interchanges in zero-resource sounding object localization

Y Zhao, C Zhang, H Huang, H Li… - Advances in Neural …, 2022 - proceedings.neurips.cc
Aiming to locate the object that emits a specified sound in complex scenes, the task of
sounding object localization bridges two perception-oriented modalities of vision and …