Dettoolchain: A new prompting paradigm to unleash detection ability of mllm

Y Wu, Y Wang, S Tang, W Wu, T He, W Ouyang… - … on Computer Vision, 2025 - Springer
We present DetToolChain, a novel prompting paradigm, to unleash the zero-shot object
detection ability of multimodal large language models (MLLMs), such as GPT-4V and …

Unified Human-centric Model, Framework and Benchmark: A Survey

X Zhao, S Sulaiman, WY Leng - IEEE Access, 2024 - ieeexplore.ieee.org
Human-centric Computer Vision Tasks (HCTs) refer to a series of tasks related to the human
body, such as Human Pose Estimation, Pedestrian Tracking, Re-Identification (ReID) …

OmniFuse: A general modality fusion framework for multi-modality learning on low-quality medical data

Y Wu, J Chen, L Hu, H Xu, H Liang, J Wu - Information Fusion, 2024 - Elsevier
Mirroring the practice of human medical experts, the integration of diverse medical
examination modalities enhances the performance of predictive models in clinical settings …

A survey on person and vehicle re‐identification

Z Wang, L Wang, Z Shi, M Zhang, Q Geng… - IET Computer …, 2024 - Wiley Online Library
Person/vehicle re‐identification aims to use technologies such as cross‐camera retrieval to
associate the same person (same vehicle) in the surveillance videos at different locations …

PoseEmbroider: Towards a 3D, Visual, Semantic-Aware Human Pose Representation

G Delmas, P Weinzaepfel, F Moreno-Noguer… - … on Computer Vision, 2025 - Springer
Aligning multiple modalities in a latent space, such as images and texts, has shown to
produce powerful semantic visual representations, fueling tasks like image captioning, text …

MK-SGN: A Spiking Graph Convolutional Network with Multimodal Fusion and Knowledge Distillation for Skeleton-based Action Recognition

N Zheng, H Xia, Z Liang, Y Chai - arXiv preprint arXiv:2404.10210, 2024 - arxiv.org
In recent years, skeleton-based action recognition, leveraging multimodal Graph
Convolutional Networks (GCN), has achieved remarkable results. However, due to their …

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

J Wu, M Zhong, S Xing, Z Lai, Z Liu, W Wang… - arXiv preprint arXiv …, 2024 - arxiv.org
We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that
unifies visual perception, understanding, and generation within a single framework. Unlike …

Prompt-supervised dynamic attention graph convolutional network for skeleton-based action recognition

S Zhu, L Sun, Z Ma, C Li, D He - Neurocomputing, 2025 - Elsevier
Skeleton-based action recognition is a core task in the field of video understanding.
Skeleton sequences are characterized by high information density, low redundancy, and …

[HTML][HTML] Linguistic-Driven Partial Semantic Relevance Learning for Skeleton-Based Action Recognition

Q Chen, Y Liu, P Huang, J Huang - Sensors, 2024 - mdpi.com
Skeleton-based action recognition, renowned for its computational efficiency and
indifference to lighting variations, has become a focal point in the realm of motion analysis …

CHASE: Learning Convex Hull Adaptive Shift for Skeleton-based Multi-Entity Action Recognition

Y Wen, M Liu, S Wu, B Ding - arXiv preprint arXiv:2410.07153, 2024 - arxiv.org
Skeleton-based multi-entity action recognition is a challenging task aiming to identify
interactive actions or group activities involving multiple diverse entities. Existing models for …