Chop & learn: Recognizing and generating object-state compositions

N Saini, H Wang, A Swaminathan… - Proceedings of the …, 2023 - openaccess.thecvf.com
Recognizing and generating object-state compositions has been a challenging task,
especially when generalizing to unseen compositions. In this paper, we study the task of …

Towards scalable neural representation for diverse videos

B He, X Yang, H Wang, Z Wu, H Chen… - Proceedings of the …, 2023 - openaccess.thecvf.com
Implicit neural representations (INR) have gained increasing attention in representing 3D
scenes and images, and have been recently applied to encode videos (eg, NeRV, E-NeRV) …

Omnivid: A generative framework for universal video understanding

J Wang, D Chen, C Luo, B He, L Yuan… - Proceedings of the …, 2024 - openaccess.thecvf.com
The core of video understanding tasks such as recognition captioning and tracking is to
automatically detect objects or actions in a video and analyze their temporal evolution …

An outlook into the future of egocentric vision

C Plizzari, G Goletto, A Furnari, S Bansal… - International Journal of …, 2024 - Springer
What will the future be? We wonder! In this survey, we explore the gap between current
research in egocentric vision and the ever-anticipated future, where wearable computing …

Multimodal dual-embedding networks for malware open-set recognition

J Guo, H Wang, Y Xu, W Xu, Y Zhan… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
Malware open-set recognition (MOSR) is an emerging research domain that aims at jointly
classifying malware samples from known families and detecting the ones from novel …

Mr. HiSum: a large-scale dataset for video highlight detection and summarization

J Sul, J Han, J Lee - Advances in Neural Information …, 2024 - proceedings.neurips.cc
Video highlight detection is a task to automatically select the most engaging moments from a
long video. This problem is highly challenging since it aims to learn a general way of finding …

Multi-task hierarchical heterogeneous fusion framework for multimodal summarization

L Zhang, X Zhang, L Han, Z Yu, Y Liu, Z Li - Information Processing & …, 2024 - Elsevier
With the rise of multimedia content on the internet, Multimodal Summarization has become a
challenging task to help individuals grasp vital information fast. However, previous methods …

CLIP-Driven Open-Vocabulary 3D Scene Graph Generation via Cross-Modality Contrastive Learning

L Chen, X Wang, J Lu, S Lin… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract 3D Scene Graph Generation (3DSGG) aims to classify objects and their predicates
within 3D point cloud scenes. However current 3DSGG methods struggle with two main …

V2xum-llm: Cross-modal video summarization with temporal prompt instruction tuning

H Hua, Y Tang, C Xu, J Luo - arXiv preprint arXiv:2404.12353, 2024 - arxiv.org
Video summarization aims to create short, accurate, and cohesive summaries of longer
videos. Despite the existence of various video summarization datasets, a notable limitation …

Scaling Up Video Summarization Pretraining with Large Language Models

DM Argaw, S Yoon, FC Heilbron… - Proceedings of the …, 2024 - openaccess.thecvf.com
Long-form video content constitutes a significant portion of internet traffic making automated
video summarization an essential research problem. However existing video summarization …