Self-chained image-language model for video localization and question answering

S Yu, J Cho, P Yadav, M Bansal - Advances in Neural …, 2024 - proceedings.neurips.cc
Recent studies have shown promising results on utilizing large pre-trained image-language
models for video question answering. While these image-language models can efficiently …

Learning to predict activity progress by self-supervised video alignment

G Donahue, E Elhamifar - … of the IEEE/CVF Conference on …, 2024 - openaccess.thecvf.com
In this paper we tackle the problem of self-supervised video alignment and activity progress
prediction using in-the-wild videos. Our proposed self-supervised representation learning …

An outlook into the future of egocentric vision

C Plizzari, G Goletto, A Furnari, S Bansal… - International Journal of …, 2024 - Springer
What will the future be? We wonder! In this survey, we explore the gap between current
research in egocentric vision and the ever-anticipated future, where wearable computing …

Online human motion analysis in industrial context: A review

T Benmessabih, R Slama, V Havard… - Engineering Applications of …, 2024 - Elsevier
Human motion analysis plays a crucial role in industry 4.0 and, more recently, in industry 5.0
where human-centered applications are becoming increasingly important, demonstrating its …

Videoeval: Comprehensive benchmark suite for low-cost evaluation of video foundation model

X Li, Z Huang, J Wang, K Li, L Wang - arXiv preprint arXiv:2407.06491, 2024 - arxiv.org
With the growth of high-quality data and advancement in visual pre-training paradigms,
Video Foundation Models (VFMs) have made significant progress recently, demonstrating …

Beyond Accuracy: Statistical Measures and Benchmark for Evaluation of Representation from Self-Supervised Learning

J Wu, S Mo, S Atito, J Kittler, Z Feng… - arXiv preprint arXiv …, 2023 - arxiv.org
Recently, self-supervised metric learning has raised attention for the potential to learn a
generic distance function. It overcomes the limitations of conventional supervised one, eg …

TransferAttn: Transferable-guided Attention Is All You Need for Video Domain Adaptation

A Sacilotti, SF Santos, N Sebe, J Almeida - arXiv preprint arXiv …, 2024 - arxiv.org
Unsupervised domain adaptation (UDA) in videos is a challenging task that remains not well
explored compared to image-based UDA techniques. Although vision transformers (ViT) …

Text-Enhanced Zero-Shot Action Recognition: A training-free approach

M Bosetti, S Zhang, B Liberatori, G Zara, E Ricci… - arXiv preprint arXiv …, 2024 - arxiv.org
Vision-language models (VLMs) have demonstrated remarkable performance across
various visual tasks, leveraging joint learning of visual and textual representations. While …

[引用][C] React to this! How humans challenge interactive agents using nonverbal behaviors

C Zhang - 2024 - Simon Fraser University