Videocon: Robust video-language alignment via contrast captions

H Bansal, Y Bitton, I Szpektor… - Proceedings of the …, 2024 - openaccess.thecvf.com
Despite being (pre) trained on a massive amount of data state-of-the-art video-language
alignment models are not robust to semantically-plausible contrastive changes in the video …

More context, less distraction: Visual classification by inferring and conditioning on contextual attributes

B An, S Zhu, MA Panaitescu-Liess… - arXiv preprint arXiv …, 2023 - arxiv.org
CLIP, as a foundational vision language model, is widely used in zero-shot image
classification due to its ability to understand various visual concepts and natural language …

Generating Enhanced Negatives for Training Language-Based Object Detectors

S Zhao, L Zhao, Y Suh, DN Metaxas… - Proceedings of the …, 2024 - openaccess.thecvf.com
The recent progress in language-based open-vocabulary object detection can be largely
attributed to finding better ways of leveraging large-scale data with free-form text …

[PDF][PDF] Enhancing Cross-Lingual Image Description: A Multimodal Approach for Semantic Relevance and Stylistic Alignment.

E Al-Buraihy, D Wang - Computers, Materials & Continua, 2024 - cdn.techscience.cn
Cross-lingual image description, the task of generating image captions in a target language
from images and descriptions in a source language, is addressed in this study through a …

The crashworthiness prediction and deformation constraint optimization of shrink energy-absorbing structures based on deep learning architecture

J He, P Xu, J Xing, S Yao, B Wang, X Zheng - Advances in Engineering …, 2024 - Elsevier
The deformation behavior of shrink energy-absorbing structures is influenced by numerous
factors, and improper matching of parameters in the design process can easily lead to …

Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection

K Park, K Saito, D Kim - arXiv preprint arXiv:2407.15296, 2024 - arxiv.org
Vision-language (VL) models often exhibit a limited understanding of complex expressions
of visual objects (eg, attributes, shapes, and their relations), given complex and diverse …

Evolving Interpretable Visual Classifiers with Large Language Models

M Chiquier, U Mall, C Vondrick - arXiv preprint arXiv:2404.09941, 2024 - arxiv.org
Multimodal pre-trained models, such as CLIP, are popular for zero-shot classification due to
their open-vocabulary flexibility and high performance. However, vision-language models …

Conceptual Learning via Embedding Approximations for Reinforcing Interpretability and Transparency

M Dikter, T Blau, C Baskin - arXiv preprint arXiv:2406.08840, 2024 - arxiv.org
Concept bottleneck models (CBMs) have emerged as critical tools in domains where
interpretability is paramount. These models rely on predefined textual descriptions, referred …

Constructing Concept-based Models to Mitigate Spurious Correlations with Minimal Human Effort

J Kim, Z Wang, Q Qiu - arXiv preprint arXiv:2407.08947, 2024 - arxiv.org
Enhancing model interpretability can address spurious correlations by revealing how
models draw their predictions. Concept Bottleneck Models (CBMs) can provide a principled …

If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions

R Esfandiarpoor, C Menghini, SH Bach - arXiv preprint arXiv:2403.16442, 2024 - arxiv.org
Recent works often assume that Vision-Language Model (VLM) representations are based
on visual attributes like shape. However, it is unclear to what extent VLMs prioritize this …