Deep learning approaches have recently raised the bar in many fields, from Natural Language Processing to Computer Vision, by leveraging large amounts of data. However …
Combining simple architectures with large-scale pre-training has led to massive improvements in image classification. For object detection, pre-training and scaling …
We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (eg, object detection, instance segmentation) and Vision-Language (VL) understanding …
Neural compression is the application of neural networks and other machine learning methods to data compression. Recent advances in statistical machine learning have opened …
Current object detectors are limited in vocabulary size due to the small scale of detection datasets. Image classifiers, on the other hand, reason about much larger vocabularies, as …
This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object …
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification in both zero-shot and transfer learning settings …
Automated visual understanding of our diverse and open world demands computer vision models to generalize well with minimal customization for specific tasks, similar to human …
Y Du, F Wei, Z Zhang, M Shi… - Proceedings of the …, 2022 - openaccess.thecvf.com
Recently, vision-language pre-training shows great potential in open-vocabulary object detection, where detectors trained on base classes are devised for detecting new classes …