Align and attend: Multimodal summarization with dual contrastive losses

N Saini, H Wang, A Swaminathan… - Proceedings of the …, 2023 - openaccess.thecvf.com

Recognizing and generating object-state compositions has been a challenging task,
especially when generalizing to unseen compositions. In this paper, we study the task of …

被引用次数：10 相关文章所有 6 个版本

[PDF] thecvf.com

Towards scalable neural representation for diverse videos

B He, X Yang, H Wang, Z Wu, H Chen… - Proceedings of the …, 2023 - openaccess.thecvf.com

Implicit neural representations (INR) have gained increasing attention in representing 3D
scenes and images, and have been recently applied to encode videos (eg, NeRV, E-NeRV) …

被引用次数：20 相关文章所有 5 个版本

[PDF] thecvf.com

Omnivid: A generative framework for universal video understanding

J Wang, D Chen, C Luo, B He, L Yuan… - Proceedings of the …, 2024 - openaccess.thecvf.com

The core of video understanding tasks such as recognition captioning and tracking is to
automatically detect objects or actions in a video and analyze their temporal evolution …

被引用次数：4 相关文章所有 3 个版本

[PDF] springer.com

An outlook into the future of egocentric vision

C Plizzari, G Goletto, A Furnari, S Bansal… - International Journal of …, 2024 - Springer

What will the future be? We wonder! In this survey, we explore the gap between current
research in egocentric vision and the ever-anticipated future, where wearable computing …

被引用次数：14 相关文章所有 7 个版本

Multimodal dual-embedding networks for malware open-set recognition

J Guo, H Wang, Y Xu, W Xu, Y Zhan… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

Malware open-set recognition (MOSR) is an emerging research domain that aims at jointly
classifying malware samples from known families and detecting the ones from novel …

被引用次数：4 相关文章所有 3 个版本

[PDF] neurips.cc

Mr. HiSum: a large-scale dataset for video highlight detection and summarization

J Sul, J Han, J Lee - Advances in Neural Information …, 2024 - proceedings.neurips.cc

Video highlight detection is a task to automatically select the most engaging moments from a
long video. This problem is highly challenging since it aims to learn a general way of finding …

被引用次数：2 相关文章所有 3 个版本

Multi-task hierarchical heterogeneous fusion framework for multimodal summarization

L Zhang, X Zhang, L Han, Z Yu, Y Liu, Z Li - Information Processing & …, 2024 - Elsevier

With the rise of multimedia content on the internet, Multimodal Summarization has become a
challenging task to help individuals grasp vital information fast. However, previous methods …

被引用次数：2 相关文章

[PDF] thecvf.com

CLIP-Driven Open-Vocabulary 3D Scene Graph Generation via Cross-Modality Contrastive Learning

L Chen, X Wang, J Lu, S Lin… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract 3D Scene Graph Generation (3DSGG) aims to classify objects and their predicates
within 3D point cloud scenes. However current 3DSGG methods struggle with two main …

[PDF] arxiv.org

V2xum-llm: Cross-modal video summarization with temporal prompt instruction tuning

H Hua, Y Tang, C Xu, J Luo - arXiv preprint arXiv:2404.12353, 2024 - arxiv.org

Video summarization aims to create short, accurate, and cohesive summaries of longer
videos. Despite the existence of various video summarization datasets, a notable limitation …

被引用次数：2 相关文章所有 2 个版本

[PDF] thecvf.com

Scaling Up Video Summarization Pretraining with Large Language Models

DM Argaw, S Yoon, FC Heilbron… - Proceedings of the …, 2024 - openaccess.thecvf.com

Long-form video content constitutes a significant portion of internet traffic making automated
video summarization an essential research problem. However existing video summarization …

高级搜索

QQ 群