Abstract Language models are capable of commonsense reasoning: while domain-specific models can learn from explicit knowledge (eg commonsense graphs [6], ethical norms [25]) …
As one of the most intuitive interfaces known to humans, natural language has the potential to mediate many tasks that involve human-computer interaction, especially in application …
Audio captioning aims to generate text descriptions from environmental sounds. One challenge of audio captioning is the difficulty of the generalization due to the lack of audio …
Large language models readily adapt to novel settings, even without task-specific training data. Can their zero-shot capacity be extended to multimodal inputs? In this work, we …
X Xu, Z Xie, M Wu, K Yu - Tech. Rep., DCASE2022 Challenge, 2022 - dcase.community
This technical report describes the system submitted to the Detection and Classification of Acoustic Scenes and Events (DCASE) 2022 challenge Task 6. There are two involving …
H Yun, J Na, G Kim - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
Sound can convey significant information for spatial reasoning in our daily lives. To endow deep networks with such ability, we address the challenge of dense indoor prediction with …
X Liu, H Lu, J Yuan, X Li - ICASSP 2023-2023 IEEE …, 2023 - ieeexplore.ieee.org
The attention-based Transformers have been increasingly applied to audio classification because of their global receptive field and ability to handle long-term dependency. However …
Compared with ample visual-text pre-training research, few works explore audio-text pre- training, mostly due to the lack of sufficient parallel audio-text data. Most existing methods …
Aiming to locate the object that emits a specified sound in complex scenes, the task of sounding object localization bridges two perception-oriented modalities of vision and …