Connecting Vision and Language plays an essential role in Generative Intelligence. For this reason, large research efforts have been devoted to image captioning, ie describing images …
Image captioning is a fundamental task in vision-language understanding, where the model predicts a textual informative caption to a given input image. In this paper, we present a …
Abstract We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models …
ImageNet-1K serves as the primary dataset for pretraining deep learning models for computer vision tasks. ImageNet-21K dataset, which is bigger and more diverse, is used …
H Chen, Z Qi, Z Shi - IEEE Transactions on Geoscience and …, 2021 - ieeexplore.ieee.org
Modern change detection (CD) has achieved remarkable success by the powerful discriminative ability of deep convolutions. However, high-resolution remote sensing CD …
This paper presents DetCLIPv2, an efficient and scalable training framework that incorporates large-scale image-text pairs to achieve open-vocabulary object detection …
V Melnychuk, D Frauen… - … Conference on Machine …, 2022 - proceedings.mlr.press
Estimating counterfactual outcomes over time from observational data is relevant for many applications (eg, personalized medicine). Yet, state-of-the-art methods build upon simple …
Different objects in the same scene are more or less related to each other, but only a limited number of these relationships are noteworthy. Inspired by Detection Transformer, which …
Contrastive learning based vision-language joint pre-training has emerged as a successful representation learning strategy. In this paper, we present a prototype representation …