Understanding long real-world videos requires modeling of long-range visual dependencies. To this end we explore video-first architectures building on the common …
The quality of the image representations obtained from self-supervised learning depends strongly on the type of data augmentations used in the learning formulation. Recent papers …
The recent large-scale vision-language pre-training (VLP) of dual-stream architectures (eg, CLIP) with a tremendous amount of image-text pair data, has shown its superiority on …
Foundation models have shown outstanding performance and generalization capabilities across domains. Since most studies on foundation models mainly focus on the pretraining …
Learning specific hands-on skills such as cooking, car maintenance, and home repairs increasingly happens via instructional videos. The user experience with such videos is …
R Hu, A Singh, T Darrell… - Proceedings of the IEEE …, 2020 - openaccess.thecvf.com
Many visual scenes contain text that carries crucial information, and it is thus essential to understand text in images for downstream reasoning tasks. For example, a deep water label …
T Wang, J Huang, H Zhang… - Proceedings of the IEEE …, 2020 - openaccess.thecvf.com
We present a novel unsupervised feature representation learning method, Visual Commonsense Region-based Convolutional Neural Network (VC R-CNN), to serve as an …
To understand the world, we humans constantly need to relate the present to the past, and put events in context. In this paper, we enable existing video models to do the same. We …
T Rahman, B Xu, L Sigal - Proceedings of the IEEE/CVF …, 2019 - openaccess.thecvf.com
Multi-modal learning, particularly among imaging and linguistic modalities, has made amazing strides in many high-level fundamental visual understanding problems, ranging …