Gta: Global temporal attention for video action understanding

MH Guo, TX Xu, JJ Liu, ZN Liu, PT Jiang, TJ Mu… - Computational visual …, 2022 - Springer

Humans can naturally and effectively find salient regions in complex scenes. Motivated by
this observation, attention mechanisms were introduced into computer vision with the aim of …

被引用次数：1868 相关文章所有 8 个版本

[PDF] thecvf.com

Asm-loc: Action-aware segment modeling for weakly-supervised temporal action localization

B He, X Yang, L Kang, Z Cheng… - Proceedings of the …, 2022 - openaccess.thecvf.com

Weakly-supervised temporal action localization aims to recognize and localize action
segments in untrimmed videos given only video-level action labels for training. Without the …

被引用次数：108 相关文章所有 5 个版本

[PDF] thecvf.com

Align and attend: Multimodal summarization with dual contrastive losses

B He, J Wang, J Qiu, T Bui… - Proceedings of the …, 2023 - openaccess.thecvf.com

The goal of multimodal summarization is to extract the most important information from
different modalities to form summaries. Unlike unimodal summarization, the multimodal …

被引用次数：59 相关文章所有 7 个版本

[PDF] thecvf.com

Chop & learn: Recognizing and generating object-state compositions

N Saini, H Wang, A Swaminathan… - Proceedings of the …, 2023 - openaccess.thecvf.com

Recognizing and generating object-state compositions has been a challenging task,
especially when generalizing to unseen compositions. In this paper, we study the task of …

被引用次数：14 相关文章所有 6 个版本

[PDF] thecvf.com

Towards scalable neural representation for diverse videos

B He, X Yang, H Wang, Z Wu, H Chen… - Proceedings of the …, 2023 - openaccess.thecvf.com

Implicit neural representations (INR) have gained increasing attention in representing 3D
scenes and images, and have been recently applied to encode videos (eg, NeRV, E-NeRV) …

被引用次数：35 相关文章所有 5 个版本

[PDF] thecvf.com

Omnivid: A generative framework for universal video understanding

J Wang, D Chen, C Luo, B He, L Yuan… - Proceedings of the …, 2024 - openaccess.thecvf.com

The core of video understanding tasks such as recognition captioning and tracking is to
automatically detect objects or actions in a video and analyze their temporal evolution …

被引用次数：16 相关文章所有 3 个版本

[PDF] arxiv.org

Efficient video transformers with spatial-temporal token selection

J Wang, X Yang, H Li, L Liu, Z Wu, YG Jiang - European Conference on …, 2022 - Springer

Video transformers have achieved impressive results on major video recognition
benchmarks, which however suffer from high computational cost. In this paper, we present …

被引用次数：78 相关文章所有 5 个版本

[PDF] arxiv.org

Metagait: Learning to learn an omni sample adaptive representation for gait recognition

H Dou, P Zhang, W Su, Y Yu, X Li - European Conference on Computer …, 2022 - Springer

Gait recognition, which aims at identifying individuals by their walking patterns, has recently
drawn increasing research attention. However, gait recognition still suffers from the conflicts …

被引用次数：38 相关文章所有 6 个版本

[PDF] researchgate.net

Improving RGB-D salient object detection via modality-aware decoder

M Song, W Song, G Yang… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org

Most existing RGB-D salient object detection (SOD) methods are primarily focusing on cross-
modal and cross-level saliency fusion, which has been proved to be efficient and effective …

被引用次数：39 相关文章所有 5 个版本

[PDF] ieee.org

Efficient spatio-temporal modeling methods for real-time violence recognition

MS Kang, RH Park, HM Park - IEEE Access, 2021 - ieeexplore.ieee.org

Violence recognition is challenging since recognition must be performed on videos acquired
by a lot of surveillance cameras at any time or place. It should make reliable detections in …

被引用次数：69 相关文章所有 3 个版本

高级搜索

QQ 群