Stavis: Spatio-temporal audiovisual saliency network

Y Wei, D Hu, Y Tian, X Li - arXiv preprint arXiv:2208.09579, 2022 - arxiv.org

Sight and hearing are two senses that play a vital role in human communication and scene
understanding. To mimic human perception ability, audio-visual learning, aimed at …

被引用次数：62 相关文章所有 2 个版本

[PDF] springer.com

Light field salient object detection: A review and benchmark

K Fu, Y Jiang, GP Ji, T Zhou, Q Zhao… - Computational Visual …, 2022 - Springer

Salient object detection (SOD) is a long-standing research topic in computer vision with
increasing interest in the past decade. Since light fields record comprehensive information of …

被引用次数：88 相关文章所有 8 个版本

[PDF] thecvf.com

Everything at once-multi-modal fusion transformer for video retrieval

N Shvetsova, B Chen… - Proceedings of the …, 2022 - openaccess.thecvf.com

Multi-modal learning from video data has seen increased attention recently as it allows
training of semantically meaningful embeddings without human annotation, enabling tasks …

被引用次数：163 相关文章所有 7 个版本

[PDF] arxiv.org

Hierarchical multimodal transformer to summarize videos

B Zhao, M Gong, X Li - Neurocomputing, 2022 - Elsevier

Although video summarization has achieved tremendous success benefiting from Recurrent
Neural Networks (RNN), RNN-based methods neglect the global dependencies and multi …

被引用次数：65 相关文章所有 5 个版本

[PDF] arxiv.org

Vinet: Pushing the limits of visual modality for audio-visual saliency prediction

S Jain, P Yarlagadda, S Jyoti, S Karthik… - 2021 IEEE/RSJ …, 2021 - ieeexplore.ieee.org

We propose the ViNet architecture for audio-visual saliency prediction. ViNet is a fully
convolutional encoder-decoder architecture. The encoder uses visual features from a …

被引用次数：88 相关文章所有 13 个版本

[PDF] thecvf.com

CASP-Net: Rethinking video saliency prediction from an audio-visual consistency perceptual perspective

J Xiong, G Wang, P Zhang, W Huang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Incorporating the audio stream enables Video Saliency Prediction (VSP) to imitate the
selective attention mechanism of human brain. By focusing on the benefits of joint auditory …

被引用次数：15 相关文章所有 5 个版本

[PDF] springer.com

In the eye of transformer: Global–local correlation for egocentric gaze estimation and beyond

B Lai, M Liu, F Ryan, JM Rehg - International Journal of Computer Vision, 2024 - Springer

Predicting human's gaze from egocentric videos serves as a critical role for human intention
understanding in daily activities. In this paper, we present the first transformer-based model …

被引用次数：14 相关文章所有 8 个版本

[PDF] thecvf.com

From semantic categories to fixations: A novel weakly-supervised visual-auditory saliency detection approach

G Wang, C Chen, DP Fan, A Hao… - Proceedings of the …, 2021 - openaccess.thecvf.com

Thanks to the rapid advances in the deep learning techniques and the wide availability of
large-scale training sets, the performances of video saliency detection models have been …

被引用次数：45 相关文章所有 5 个版本

[PDF] thecvf.com

Repetitive activity counting by sight and sound

Y Zhang, L Shao, CGM Snoek - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

This paper strives for repetitive activity counting in videos. Different from existing works,
which all analyze the visual video content only, we incorporate for the first time the …

被引用次数：59 相关文章所有 8 个版本

[PDF] thecvf.com

Beyond image to depth: Improving depth prediction using echoes

KK Parida, S Srivastava… - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com

We address the problem of estimating depth with multi modal audio visual data. Inspired by
the ability of animals, such as bats and dolphins, to infer distance of objects with …

被引用次数：45 相关文章所有 6 个版本

高级搜索

QQ 群