Grounding DINO 1.5: Advance the" Edge" of Open-Set Object Detection

T Ren, Q Jiang, S Liu, Z Zeng, W Liu, H Gao… - arXiv preprint arXiv …, 2024 - arxiv.org
This paper introduces Grounding DINO 1.5, a suite of advanced open-set object detection
models developed by IDEA Research, which aims to advance the" Edge" of open-set object …

PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model

Z Zhang, Y Ma, E Zhang, X Bai - arXiv preprint arXiv:2403.14598, 2024 - arxiv.org
PSALM is a powerful extension of the Large Multi-modal Model (LMM) to address the
segmentation task challenges. To overcome the limitation of the LMM being limited to textual …

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

J Wu, M Zhong, S Xing, Z Lai, Z Liu, W Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that
unifies visual perception, understanding, and generation within a single framework. Unlike …

[HTML][HTML] Real-Time Camera Operator Segmentation with YOLOv8 in Football Video Broadcasts

S Postupaiev, R Damaševičius, R Maskeliūnas - AI, 2024 - mdpi.com
Using instance segmentation and video inpainting provides a significant leap in real-time
football video broadcast enhancements by removing potential visual distractions, such as an …

Panoptic Water Surface Visual Perception for USVs using Monocular Camera Sensor

H Xu, X Zhang, J He, Z Geng, Y Yu… - IEEE Sensors …, 2024 - ieeexplore.ieee.org
In recent years, the significance of unmanned surface vehicles (USVs) has grown
substantially across a wide range of applications. Monocular cameras, as the most common …

ImageNet3D: Towards General-Purpose Object-Level 3D Understanding

W Ma, G Zeng, G Zhang, Q Liu, L Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
A vision model with general-purpose object-level 3D understanding should be capable of
inferring both 2D (eg, class name and bounding box) and 3D information (eg, 3D location …

UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces

J Wu, Y Jiang, B Yan, H Lu, Z Yuan, P Luo - arXiv preprint arXiv …, 2023 - arxiv.org
The reference-based object segmentation tasks, namely referring image segmentation
(RIS), few-shot image segmentation (FSS), referring video object segmentation (RVOS), and …

1st Place Solution for MeViS Track in CVPR 2024 PVUW Workshop: Motion Expression guided Video Segmentation

M Gao, J Luo, J Yang, J Han, F Zheng - arXiv preprint arXiv:2406.07043, 2024 - arxiv.org
Motion Expression guided Video Segmentation (MeViS), as an emerging task, poses many
new challenges to the field of referring video object segmentation (RVOS). In this technical …

Foundation Models for Video Understanding: A Survey

N Madan, A Møgelmose, R Modi, YS Rawat… - arXiv preprint arXiv …, 2024 - arxiv.org
Video Foundation Models (ViFMs) aim to learn a general-purpose representation for various
video understanding tasks. Leveraging large-scale datasets and powerful models, ViFMs …

MonoDETRNext: Next-generation Accurate and Efficient Monocular 3D Object Detection Method

P Liao, F Yang, D Wu, L Bo - arXiv preprint arXiv:2405.15176, 2024 - arxiv.org
Monocular vision-based 3D object detection is crucial in various sectors, yet existing
methods face significant challenges in terms of accuracy and computational efficiency …