Eva: Exploring the limits of masked visual representation learning at scale

Y Fang, W Wang, B Xie, Q Sun, L Wu… - Proceedings of the …, 2023 - openaccess.thecvf.com
We launch EVA, a vision-centric foundation model to explore the limits of visual
representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained …

Diffusiondet: Diffusion model for object detection

S Chen, P Sun, Y Song, P Luo - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
We propose DiffusionDet, a new framework that formulates object detection as a denoising
diffusion process from noisy boxes to object boxes. During the training stage, object boxes …

Video-chatgpt: Towards detailed video understanding via large vision and language models

M Maaz, H Rasheed, S Khan, FS Khan - arXiv preprint arXiv:2306.05424, 2023 - arxiv.org
Conversation agents fueled by Large Language Models (LLMs) are providing a new way to
interact with visual data. While there have been initial attempts for image-based …

Eva-02: A visual representation for neon genesis

Y Fang, Q Sun, X Wang, T Huang, X Wang… - Image and Vision …, 2024 - Elsevier
We launch EVA-02, a next-generation Transformer-based visual representation pre-trained
to reconstruct strong and robust language-aligned vision features via masked image …

Detrs with collaborative hybrid assignments training

Z Zong, G Song, Y Liu - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
In this paper, we provide the observation that too few queries assigned as positive samples
in DETR with one-to-one set matching leads to sparse supervision on the encoder's output …

Exploring plain vision transformer backbones for object detection

Y Li, H Mao, R Girshick, K He - European conference on computer vision, 2022 - Springer
We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for
object detection. This design enables the original ViT architecture to be fine-tuned for object …

Simple open-vocabulary object detection

M Minderer, A Gritsenko, A Stone, M Neumann… - … on Computer Vision, 2022 - Springer
Combining simple architectures with large-scale pre-training has led to massive
improvements in image classification. For object detection, pre-training and scaling …

Detecting twenty-thousand classes using image-level supervision

X Zhou, R Girdhar, A Joulin, P Krähenbühl… - European Conference on …, 2022 - Springer
Current object detectors are limited in vocabulary size due to the small scale of detection
datasets. Image classifiers, on the other hand, reason about much larger vocabularies, as …

Regionclip: Region-based language-image pretraining

Y Zhong, J Yang, P Zhang, C Li… - Proceedings of the …, 2022 - openaccess.thecvf.com
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved
impressive results on image classification in both zero-shot and transfer learning settings …

TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios

X Zhu, S Lyu, X Wang, Q Zhao - Proceedings of the IEEE …, 2021 - openaccess.thecvf.com
Object detection on drone-captured scenarios is a recent popular task. As drones always
navigate in different altitudes, the object scale varies violently, which burdens the …