相关文章- 学术资源搜索

Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning

W Li, C Gao, G Niu, X Xiao, H Liu, J Liu, H Wu… - arXiv preprint arXiv …, 2020 - arxiv.org

Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and
cannot effectively adapt to each other. They can only utilize single-modal data (ie text or …

被引用次数：356 相关文章所有 4 个版本

[PDF] thecvf.com

A simple recipe for contrastively pre-training video-first encoders beyond 16 frames

P Papalampidi, S Koppula, S Pathak… - Proceedings of the …, 2024 - openaccess.thecvf.com

Understanding long real-world videos requires modeling of long-range visual
dependencies. To this end we explore video-first architectures building on the common …

被引用次数：8 相关文章所有 3 个版本

[PDF] thecvf.com

Space-time crop & attend: Improving cross-modal video representation learning

M Patrick, PY Huang, I Misra, F Metze… - Proceedings of the …, 2021 - openaccess.thecvf.com

The quality of the image representations obtained from self-supervised learning depends
strongly on the type of data augmentations used in the learning formulation. Recent papers …

被引用次数：38 相关文章所有 11 个版本

[PDF] arxiv.org

Enabling multimodal generation on clip via vision-language knowledge distillation

W Dai, L Hou, L Shang, X Jiang, Q Liu… - arXiv preprint arXiv …, 2022 - arxiv.org

The recent large-scale vision-language pre-training (VLP) of dual-stream architectures (eg,
CLIP) with a tremendous amount of image-text pair data, has shown its superiority on …

被引用次数：78 相关文章所有 5 个版本

[PDF] thecvf.com

Meltr: Meta loss transformer for learning to fine-tune video foundation models

D Ko, J Choi, HK Choi, KW On… - Proceedings of the …, 2023 - openaccess.thecvf.com

Foundation models have shown outstanding performance and generalization capabilities
across domains. Since most studies on foundation models mainly focus on the pretraining …

被引用次数：12 相关文章所有 6 个版本

[PDF] arxiv.org

Multimodal pretraining for dense video captioning

G Huang, B Pang, Z Zhu, C Rivera, R Soricut - arXiv preprint arXiv …, 2020 - arxiv.org

Learning specific hands-on skills such as cooking, car maintenance, and home repairs
increasingly happens via instructional videos. The user experience with such videos is …

被引用次数：73 相关文章所有 3 个版本

[PDF] thecvf.com

Iterative answer prediction with pointer-augmented multimodal transformers for textvqa

R Hu, A Singh, T Darrell… - Proceedings of the IEEE …, 2020 - openaccess.thecvf.com

Many visual scenes contain text that carries crucial information, and it is thus essential to
understand text in images for downstream reasoning tasks. For example, a deep water label …

被引用次数：212 相关文章所有 6 个版本

[PDF] thecvf.com

Visual commonsense r-cnn

T Wang, J Huang, H Zhang… - Proceedings of the IEEE …, 2020 - openaccess.thecvf.com

We present a novel unsupervised feature representation learning method, Visual
Commonsense Region-based Convolutional Neural Network (VC R-CNN), to serve as an …

被引用次数：273 相关文章所有 10 个版本

[PDF] thecvf.com

Long-term feature banks for detailed video understanding

CY Wu, C Feichtenhofer, H Fan, K He… - Proceedings of the …, 2019 - openaccess.thecvf.com

To understand the world, we humans constantly need to relate the present to the past, and
put events in context. In this paper, we enable existing video models to do the same. We …

被引用次数：549 相关文章所有 10 个版本

[PDF] thecvf.com

Watch, listen and tell: Multi-modal weakly supervised dense event captioning

T Rahman, B Xu, L Sigal - Proceedings of the IEEE/CVF …, 2019 - openaccess.thecvf.com

Multi-modal learning, particularly among imaging and linguistic modalities, has made
amazing strides in many high-level fundamental visual understanding problems, ranging …

被引用次数：104 相关文章所有 8 个版本

高级搜索

QQ 群

Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning

A simple recipe for contrastively pre-training video-first encoders beyond 16 frames

Space-time crop & attend: Improving cross-modal video representation learning

Enabling multimodal generation on clip via vision-language knowledge distillation

Meltr: Meta loss transformer for learning to fine-tune video foundation models

Multimodal pretraining for dense video captioning

Iterative answer prediction with pointer-augmented multimodal transformers for textvqa

Visual commonsense r-cnn

Long-term feature banks for detailed video understanding

Watch, listen and tell: Multi-modal weakly supervised dense event captioning

引用