Learning in audio-visual context: A review, analysis, and new perspective

Y Wei, D Hu, Y Tian, X Li - arXiv preprint arXiv:2208.09579, 2022 - arxiv.org
Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …

Light field salient object detection: A review and benchmark

K Fu, Y Jiang, GP Ji, T Zhou, Q Zhao… - Computational Visual …, 2022 - Springer
Salient object detection (SOD) is a long-standing research topic in computer vision with
increasing interest in the past decade. Since light fields record comprehensive information of …

Everything at once-multi-modal fusion transformer for video retrieval

N Shvetsova, B Chen… - Proceedings of the …, 2022 - openaccess.thecvf.com
Multi-modal learning from video data has seen increased attention recently as it allows
training of semantically meaningful embeddings without human annotation, enabling tasks …

Hierarchical multimodal transformer to summarize videos

B Zhao, M Gong, X Li - Neurocomputing, 2022 - Elsevier
Although video summarization has achieved tremendous success benefiting from Recurrent
Neural Networks (RNN), RNN-based methods neglect the global dependencies and multi …

Vinet: Pushing the limits of visual modality for audio-visual saliency prediction

S Jain, P Yarlagadda, S Jyoti, S Karthik… - 2021 IEEE/RSJ …, 2021 - ieeexplore.ieee.org
We propose the ViNet architecture for audio-visual saliency prediction. ViNet is a fully
convolutional encoder-decoder architecture. The encoder uses visual features from a …

CASP-Net: Rethinking video saliency prediction from an audio-visual consistency perceptual perspective

J Xiong, G Wang, P Zhang, W Huang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Incorporating the audio stream enables Video Saliency Prediction (VSP) to imitate the
selective attention mechanism of human brain. By focusing on the benefits of joint auditory …

In the eye of transformer: Global–local correlation for egocentric gaze estimation and beyond

B Lai, M Liu, F Ryan, JM Rehg - International Journal of Computer Vision, 2024 - Springer
Predicting human's gaze from egocentric videos serves as a critical role for human intention
understanding in daily activities. In this paper, we present the first transformer-based model …

From semantic categories to fixations: A novel weakly-supervised visual-auditory saliency detection approach

G Wang, C Chen, DP Fan, A Hao… - Proceedings of the …, 2021 - openaccess.thecvf.com
Thanks to the rapid advances in the deep learning techniques and the wide availability of
large-scale training sets, the performances of video saliency detection models have been …

Repetitive activity counting by sight and sound

Y Zhang, L Shao, CGM Snoek - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
This paper strives for repetitive activity counting in videos. Different from existing works,
which all analyze the visual video content only, we incorporate for the first time the …

Beyond image to depth: Improving depth prediction using echoes

KK Parida, S Srivastava… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
We address the problem of estimating depth with multi modal audio visual data. Inspired by
the ability of animals, such as bats and dolphins, to infer distance of objects with …