General object foundation model for images and videos at scale

T Ren, Q Jiang, S Liu, Z Zeng, W Liu, H Gao… - arXiv preprint arXiv …, 2024 - arxiv.org

This paper introduces Grounding DINO 1.5, a suite of advanced open-set object detection
models developed by IDEA Research, which aims to advance the" Edge" of open-set object …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model

Z Zhang, Y Ma, E Zhang, X Bai - arXiv preprint arXiv:2403.14598, 2024 - arxiv.org

PSALM is a powerful extension of the Large Multi-modal Model (LMM) to address the
segmentation task challenges. To overcome the limitation of the LMM being limited to textual …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

J Wu, M Zhong, S Xing, Z Lai, Z Liu, W Wang… - arXiv preprint arXiv …, 2024 - arxiv.org

We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that
unifies visual perception, understanding, and generation within a single framework. Unlike …

[HTML][HTML] Real-Time Camera Operator Segmentation with YOLOv8 in Football Video Broadcasts

S Postupaiev, R Damaševičius, R Maskeliūnas - AI, 2024 - mdpi.com

Using instance segmentation and video inpainting provides a significant leap in real-time
football video broadcast enhancements by removing potential visual distractions, such as an …

被引用次数：1 相关文章

Panoptic Water Surface Visual Perception for USVs using Monocular Camera Sensor

H Xu, X Zhang, J He, Z Geng, Y Yu… - IEEE Sensors …, 2024 - ieeexplore.ieee.org

In recent years, the significance of unmanned surface vehicles (USVs) has grown
substantially across a wide range of applications. Monocular cameras, as the most common …

[PDF] arxiv.org

ImageNet3D: Towards General-Purpose Object-Level 3D Understanding

W Ma, G Zeng, G Zhang, Q Liu, L Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org

A vision model with general-purpose object-level 3D understanding should be capable of
inferring both 2D (eg, class name and bounding box) and 3D information (eg, 3D location …

[PDF] arxiv.org

UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces

J Wu, Y Jiang, B Yan, H Lu, Z Yuan, P Luo - arXiv preprint arXiv …, 2023 - arxiv.org

The reference-based object segmentation tasks, namely referring image segmentation
(RIS), few-shot image segmentation (FSS), referring video object segmentation (RVOS), and …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

1st Place Solution for MeViS Track in CVPR 2024 PVUW Workshop: Motion Expression guided Video Segmentation

M Gao, J Luo, J Yang, J Han, F Zheng - arXiv preprint arXiv:2406.07043, 2024 - arxiv.org

Motion Expression guided Video Segmentation (MeViS), as an emerging task, poses many
new challenges to the field of referring video object segmentation (RVOS). In this technical …

被引用次数：1 相关文章所有 4 个版本

[PDF] arxiv.org

Foundation Models for Video Understanding: A Survey

N Madan, A Møgelmose, R Modi, YS Rawat… - arXiv preprint arXiv …, 2024 - arxiv.org

Video Foundation Models (ViFMs) aim to learn a general-purpose representation for various
video understanding tasks. Leveraging large-scale datasets and powerful models, ViFMs …

被引用次数：3 相关文章所有 5 个版本

[PDF] arxiv.org

MonoDETRNext: Next-generation Accurate and Efficient Monocular 3D Object Detection Method

P Liao, F Yang, D Wu, L Bo - arXiv preprint arXiv:2405.15176, 2024 - arxiv.org

Monocular vision-based 3D object detection is crucial in various sectors, yet existing
methods face significant challenges in terms of accuracy and computational efficiency …

高级搜索

QQ 群