Y Tian, D Su, S Lauria, X Liu - Neurocomputing, 2022 - Elsevier
The loss function, also known as cost function, is used for training a neural network or other machine learning models. Over the past decade, researchers have designed many loss …
In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary …
A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose multimodal foundation model BEiT-3, which achieves …
Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early …
We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained …
Abstract This work presents Depth Anything a highly practical solution for robust monocular depth estimation. Without pursuing novel technical modules we aim to build a simple yet …
We present DINO (\textbf {D} ETR with\textbf {I} mproved de\textbf {N} oising anch\textbf {O} r boxes), a state-of-the-art end-to-end object detector.% in this paper. DINO improves over …
In this paper we present Mask DINO, a unified object detection and segmentation framework. Mask DINO extends DINO (DETR with Improved Denoising Anchor Boxes) by …
Current object detectors are limited in vocabulary size due to the small scale of detection datasets. Image classifiers, on the other hand, reason about much larger vocabularies, as …