Weakly-supervised temporal action localization aims to recognize and localize action segments in untrimmed videos given only video-level action labels for training. Without the …
B He, J Wang, J Qiu, T Bui… - Proceedings of the …, 2023 - openaccess.thecvf.com
The goal of multimodal summarization is to extract the most important information from different modalities to form summaries. Unlike unimodal summarization, the multimodal …
Recognizing and generating object-state compositions has been a challenging task, especially when generalizing to unseen compositions. In this paper, we study the task of …
Implicit neural representations (INR) have gained increasing attention in representing 3D scenes and images, and have been recently applied to encode videos (eg, NeRV, E-NeRV) …
J Wang, D Chen, C Luo, B He, L Yuan… - Proceedings of the …, 2024 - openaccess.thecvf.com
The core of video understanding tasks such as recognition captioning and tracking is to automatically detect objects or actions in a video and analyze their temporal evolution …
Video transformers have achieved impressive results on major video recognition benchmarks, which however suffer from high computational cost. In this paper, we present …
H Dou, P Zhang, W Su, Y Yu, X Li - European Conference on Computer …, 2022 - Springer
Gait recognition, which aims at identifying individuals by their walking patterns, has recently drawn increasing research attention. However, gait recognition still suffers from the conflicts …
M Song, W Song, G Yang… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Most existing RGB-D salient object detection (SOD) methods are primarily focusing on cross- modal and cross-level saliency fusion, which has been proved to be efficient and effective …
Violence recognition is challenging since recognition must be performed on videos acquired by a lot of surveillance cameras at any time or place. It should make reliable detections in …