Token turing machines

P Papalampidi, S Koppula, S Pathak… - Proceedings of the …, 2024 - openaccess.thecvf.com

Understanding long real-world videos requires modeling of long-range visual
dependencies. To this end we explore video-first architectures building on the common …

被引用次数：8 相关文章所有 3 个版本

[PDF] thecvf.com

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

AJ Piergiovanni, I Noble, D Kim… - Proceedings of the …, 2024 - openaccess.thecvf.com

One of the main challenges of multimodal learning is the need to combine heterogeneous
modalities (eg video audio text). For example video and audio are obtained at much higher …

被引用次数：7 相关文章所有 3 个版本

[PDF] thecvf.com

Streaming dense video captioning

X Zhou, A Arnab, S Buch, S Yan… - Proceedings of the …, 2024 - openaccess.thecvf.com

An ideal model for dense video captioning--predicting captions localized temporally in a
video--should be able to handle long input videos predict rich detailed textual descriptions …

被引用次数：4 相关文章所有 3 个版本

[PDF] arxiv.org

Memory consolidation enables long-context video understanding

I Balažević, Y Shi, P Papalampidi, R Chaabouni… - arXiv preprint arXiv …, 2024 - arxiv.org

Most transformer-based video encoders are limited to short temporal contexts due to their
quadratic complexity. While various attempts have been made to extend this context, this …

被引用次数：7 相关文章所有 2 个版本

[PDF] arxiv.org

Generalization to new sequential decision making tasks with in-context learning

SC Raparthy, E Hambro, R Kirk, M Henaff… - arXiv preprint arXiv …, 2023 - arxiv.org

Training autonomous agents that can learn new tasks from only a handful of demonstrations
is a long-standing problem in machine learning. Recently, transformers have been shown to …

被引用次数：5 相关文章所有 3 个版本

[PDF] thecvf.com

Victr: Video-conditioned text representations for activity recognition

K Kahatapitiya, A Arnab, A Nagrani… - Proceedings of the …, 2024 - openaccess.thecvf.com

Abstract Vision-Language models (VLMs) have excelled in the image-domain---especially in
zero-shot settings---thanks to the availability of vast pretraining data (ie paired image-text …

被引用次数：8 相关文章所有 4 个版本

[PDF] cameronbuckner.net

[图书][B] From deep learning to rational machines: What the history of philosophy can teach us about the future of artificial intelligence

CJ Buckner - 2024 - books.google.com

" This book provides a framework for thinking about foundational philosophical questions
surrounding machine learning as an approach to artificial intelligence. Specifically, it links …

被引用次数：11 相关文章所有 6 个版本

[PDF] neurips.cc

高级搜索

QQ 群

A simple recipe for contrastively pre-training video-first encoders beyond 16 frames

Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities

Streaming dense video captioning

Memory consolidation enables long-context video understanding

Generalization to new sequential decision making tasks with in-context learning

Victr: Video-conditioned text representations for activity recognition

[图书][B] From deep learning to rational machines: What the history of philosophy can teach us about the future of artificial intelligence

Active vision reinforcement learning under limited visual observability

MCMNET: Multi-Scale Context Modeling Network for Temporal Action Detection

TTM-RE: Memory-Augmented Document-Level Relation Extraction

引用