Emerging Pixel Grounding in Large Multimodal Models Without Grounding Supervision

S Cao, LY Gui, YX Wang - arXiv preprint arXiv:2410.08209, 2024 - arxiv.org
Current large multimodal models (LMMs) face challenges in grounding, which requires the
model to relate language components to visual entities. Contrary to the common practice …

Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding

Y Man, S Zheng, Z Bao, M Hebert, LY Gui… - arXiv preprint arXiv …, 2024 - arxiv.org
Complex 3D scene understanding has gained increasing attention, with scene encoding
strategies playing a crucial role in this success. However, the optimal scene encoding …

MacDiff: Unified Skeleton Modeling with Masked Conditional Diffusion

L Wu, L Lin, J Zhang, Y Ma, J Liu - arXiv preprint arXiv:2409.10473, 2024 - arxiv.org
Self-supervised learning has proved effective for skeleton-based human action
understanding. However, previous works either rely on contrastive learning that suffers false …

Denoising Reuse: Exploiting Inter-frame Motion Consistency for Efficient Video Latent Generation

C Wang, S Yan, Y Chen, Y Wang, M Dong… - arXiv preprint arXiv …, 2024 - arxiv.org
Video generation using diffusion-based models is constrained by high computational costs
due to the frame-wise iterative diffusion process. This work presents a Diffusion Reuse …

Zero-Shot Video Semantic Segmentation based on Pre-Trained Diffusion Models

Q Wang, A Eldesokey, M Mendiratta, F Zhan… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce the first zero-shot approach for Video Semantic Segmentation (VSS) based on
pre-trained diffusion models. A growing research direction attempts to employ diffusion …

[PDF][PDF] Lexicon3D: Probing Visual Encoding Models for Complex 3D Scene Understanding

Y Man, S Zheng, Z Bao, M Hebert, LY Gui, YX Wang - yunzeman.github.io
Complex 3D scene understanding has gained increasing attention, with scene encoding
strategies playing a crucial role in this success. However, the optimal scene encoding …