Unimo: Towards unified-modal understanding and generation via cross-modal contrastive learning

W Li, C Gao, G Niu, X Xiao, H Liu, J Liu, H Wu… - arXiv preprint arXiv …, 2020 - arxiv.org
Existed pre-training methods either focus on single-modal tasks or multi-modal tasks, and
cannot effectively adapt to each other. They can only utilize single-modal data (ie text or …

A simple recipe for contrastively pre-training video-first encoders beyond 16 frames

P Papalampidi, S Koppula, S Pathak… - Proceedings of the …, 2024 - openaccess.thecvf.com
Understanding long real-world videos requires modeling of long-range visual
dependencies. To this end we explore video-first architectures building on the common …

Space-time crop & attend: Improving cross-modal video representation learning

M Patrick, PY Huang, I Misra, F Metze… - Proceedings of the …, 2021 - openaccess.thecvf.com
The quality of the image representations obtained from self-supervised learning depends
strongly on the type of data augmentations used in the learning formulation. Recent papers …

Enabling multimodal generation on clip via vision-language knowledge distillation

W Dai, L Hou, L Shang, X Jiang, Q Liu… - arXiv preprint arXiv …, 2022 - arxiv.org
The recent large-scale vision-language pre-training (VLP) of dual-stream architectures (eg,
CLIP) with a tremendous amount of image-text pair data, has shown its superiority on …

Meltr: Meta loss transformer for learning to fine-tune video foundation models

D Ko, J Choi, HK Choi, KW On… - Proceedings of the …, 2023 - openaccess.thecvf.com
Foundation models have shown outstanding performance and generalization capabilities
across domains. Since most studies on foundation models mainly focus on the pretraining …

Multimodal pretraining for dense video captioning

G Huang, B Pang, Z Zhu, C Rivera, R Soricut - arXiv preprint arXiv …, 2020 - arxiv.org
Learning specific hands-on skills such as cooking, car maintenance, and home repairs
increasingly happens via instructional videos. The user experience with such videos is …

Iterative answer prediction with pointer-augmented multimodal transformers for textvqa

R Hu, A Singh, T Darrell… - Proceedings of the IEEE …, 2020 - openaccess.thecvf.com
Many visual scenes contain text that carries crucial information, and it is thus essential to
understand text in images for downstream reasoning tasks. For example, a deep water label …

Visual commonsense r-cnn

T Wang, J Huang, H Zhang… - Proceedings of the IEEE …, 2020 - openaccess.thecvf.com
We present a novel unsupervised feature representation learning method, Visual
Commonsense Region-based Convolutional Neural Network (VC R-CNN), to serve as an …

Long-term feature banks for detailed video understanding

CY Wu, C Feichtenhofer, H Fan, K He… - Proceedings of the …, 2019 - openaccess.thecvf.com
To understand the world, we humans constantly need to relate the present to the past, and
put events in context. In this paper, we enable existing video models to do the same. We …

Watch, listen and tell: Multi-modal weakly supervised dense event captioning

T Rahman, B Xu, L Sigal - Proceedings of the IEEE/CVF …, 2019 - openaccess.thecvf.com
Multi-modal learning, particularly among imaging and linguistic modalities, has made
amazing strides in many high-level fundamental visual understanding problems, ranging …