Desco: Learning object recognition with rich language descriptions

H Bansal, Y Bitton, I Szpektor… - Proceedings of the …, 2024 - openaccess.thecvf.com

Despite being (pre) trained on a massive amount of data state-of-the-art video-language
alignment models are not robust to semantically-plausible contrastive changes in the video …

被引用次数：5 相关文章所有 4 个版本

[PDF] arxiv.org

More context, less distraction: Visual classification by inferring and conditioning on contextual attributes

B An, S Zhu, MA Panaitescu-Liess… - arXiv preprint arXiv …, 2023 - arxiv.org

CLIP, as a foundational vision language model, is widely used in zero-shot image
classification due to its ability to understand various visual concepts and natural language …

被引用次数：5 相关文章所有 3 个版本

[PDF] thecvf.com

Generating Enhanced Negatives for Training Language-Based Object Detectors

S Zhao, L Zhao, Y Suh, DN Metaxas… - Proceedings of the …, 2024 - openaccess.thecvf.com

The recent progress in language-based open-vocabulary object detection can be largely
attributed to finding better ways of leveraging large-scale data with free-form text …

[PDF][PDF] Enhancing Cross-Lingual Image Description: A Multimodal Approach for Semantic Relevance and Stylistic Alignment.

E Al-Buraihy, D Wang - Computers, Materials & Continua, 2024 - cdn.techscience.cn

Cross-lingual image description, the task of generating image captions in a target language
from images and descriptions in a source language, is addressed in this study through a …

被引用次数：1 相关文章所有 3 个版本

The crashworthiness prediction and deformation constraint optimization of shrink energy-absorbing structures based on deep learning architecture

J He, P Xu, J Xing, S Yao, B Wang, X Zheng - Advances in Engineering …, 2024 - Elsevier

The deformation behavior of shrink energy-absorbing structures is influenced by numerous
factors, and improper matching of parameters in the design process can easily lead to …

[PDF] arxiv.org

Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection

K Park, K Saito, D Kim - arXiv preprint arXiv:2407.15296, 2024 - arxiv.org

Vision-language (VL) models often exhibit a limited understanding of complex expressions
of visual objects (eg, attributes, shapes, and their relations), given complex and diverse …

相关文章所有 2 个版本

[PDF] arxiv.org

Evolving Interpretable Visual Classifiers with Large Language Models

M Chiquier, U Mall, C Vondrick - arXiv preprint arXiv:2404.09941, 2024 - arxiv.org

Multimodal pre-trained models, such as CLIP, are popular for zero-shot classification due to
their open-vocabulary flexibility and high performance. However, vision-language models …

相关文章所有 2 个版本

[PDF] arxiv.org

Conceptual Learning via Embedding Approximations for Reinforcing Interpretability and Transparency

M Dikter, T Blau, C Baskin - arXiv preprint arXiv:2406.08840, 2024 - arxiv.org

Concept bottleneck models (CBMs) have emerged as critical tools in domains where
interpretability is paramount. These models rely on predefined textual descriptions, referred …

相关文章所有 2 个版本

[PDF] arxiv.org

Constructing Concept-based Models to Mitigate Spurious Correlations with Minimal Human Effort

J Kim, Z Wang, Q Qiu - arXiv preprint arXiv:2407.08947, 2024 - arxiv.org

Enhancing model interpretability can address spurious correlations by revealing how
models draw their predictions. Concept Bottleneck Models (CBMs) can provide a principled …

相关文章所有 2 个版本

[PDF] arxiv.org

If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions

R Esfandiarpoor, C Menghini, SH Bach - arXiv preprint arXiv:2403.16442, 2024 - arxiv.org

Recent works often assume that Vision-Language Model (VLM) representations are based
on visual attributes like shape. However, it is unclear to what extent VLMs prioritize this …

被引用次数：1 相关文章所有 4 个版本

高级搜索

QQ 群