A Nadeem,
A Hilton, R Dawes… - Proceedings of the …, 2024 - openaccess.thecvf.com
In the context of Audio Visual Question Answering (AVQA) tasks, the audio and visual
modalities could be learnt on three levels: 1) Spatial, 2) Temporal, and 3) Semantic. Existing …