A simple recipe for contrastively pre-training video-first encoders beyond 16 frames

P Papalampidi, S Koppula, S Pathak… - Proceedings of the …, 2024 - openaccess.thecvf.com
Understanding long real-world videos requires modeling of long-range visual
dependencies. To this end we explore video-first architectures building on the common …

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

AJ Piergiovanni, I Noble, D Kim… - Proceedings of the …, 2024 - openaccess.thecvf.com
One of the main challenges of multimodal learning is the need to combine heterogeneous
modalities (eg video audio text). For example video and audio are obtained at much higher …

Streaming dense video captioning

X Zhou, A Arnab, S Buch, S Yan… - Proceedings of the …, 2024 - openaccess.thecvf.com
An ideal model for dense video captioning--predicting captions localized temporally in a
video--should be able to handle long input videos predict rich detailed textual descriptions …

Memory consolidation enables long-context video understanding

I Balažević, Y Shi, P Papalampidi, R Chaabouni… - arXiv preprint arXiv …, 2024 - arxiv.org
Most transformer-based video encoders are limited to short temporal contexts due to their
quadratic complexity. While various attempts have been made to extend this context, this …

Generalization to new sequential decision making tasks with in-context learning

SC Raparthy, E Hambro, R Kirk, M Henaff… - arXiv preprint arXiv …, 2023 - arxiv.org
Training autonomous agents that can learn new tasks from only a handful of demonstrations
is a long-standing problem in machine learning. Recently, transformers have been shown to …

Victr: Video-conditioned text representations for activity recognition

K Kahatapitiya, A Arnab, A Nagrani… - Proceedings of the …, 2024 - openaccess.thecvf.com
Abstract Vision-Language models (VLMs) have excelled in the image-domain---especially in
zero-shot settings---thanks to the availability of vast pretraining data (ie paired image-text …

[图书][B] From deep learning to rational machines: What the history of philosophy can teach us about the future of artificial intelligence

CJ Buckner - 2024 - books.google.com
" This book provides a framework for thinking about foundational philosophical questions
surrounding machine learning as an approach to artificial intelligence. Specifically, it links …

Active vision reinforcement learning under limited visual observability

J Shang, MS Ryoo - Advances in Neural Information …, 2024 - proceedings.neurips.cc
In this work, we investigate Active Vision Reinforcement Learning (ActiveVision-RL), where
an embodied agent simultaneously learns action policy for the task while also controlling its …

MCMNET: Multi-Scale Context Modeling Network for Temporal Action Detection

H Zhang, F Zhou, C Ma, D Wang, W Zhang - Sensors, 2023 - mdpi.com
Temporal action detection is a very important and challenging task in the field of video
understanding, especially for datasets with significant differences in action duration. The …

TTM-RE: Memory-Augmented Document-Level Relation Extraction

C Gao, X Wang, J Sun - arXiv preprint arXiv:2406.05906, 2024 - arxiv.org
Document-level relation extraction aims to categorize the association between any two
entities within a document. We find that previous methods for document-level relation …