Sugarcrepe: Fixing hackable benchmarks for vision-language compositionality

CY Hsieh, J Zhang, Z Ma… - Advances in neural …, 2024 - proceedings.neurips.cc
In the last year alone, a surge of new benchmarks to measure $\textit {compositional} $
understanding of vision-language models have permeated the machine learning ecosystem …

Finematch: Aspect-based fine-grained image and text mismatch detection and correction

H Hua, J Shi, K Kafle, S Jenni, D Zhang… - … on Computer Vision, 2025 - Springer
Recent progress in large-scale pre-training has led to the development of advanced vision-
language models (VLMs) with remarkable proficiency in comprehending and generating …

International Scientific Report on the Safety of Advanced AI (Interim Report)

Y Bengio, S Mindermann, D Privitera… - arXiv preprint arXiv …, 2024 - arxiv.org
This is the interim publication of the first International Scientific Report on the Safety of
Advanced AI. The report synthesises the scientific understanding of general-purpose AI--AI …

An enhanced prompt-based LLM reasoning scheme via knowledge graph-integrated collaboration

Y Li, R Zhang, J Liu - International Conference on Artificial Neural …, 2024 - Springer
Abstract While Large Language Models (LLMs) demonstrate exceptional performance in a
multitude of Natural Language Processing (NLP) tasks, they encounter challenges in …

Tailoring self-rationalizers with multi-reward distillation

S Ramnath, B Joshi, S Hallinan, X Lu, LH Li… - arXiv preprint arXiv …, 2023 - arxiv.org
Large language models (LMs) are capable of generating free-text rationales to aid question
answering. However, prior work 1) suggests that useful self-rationalization is emergent only …

Intentionqa: A benchmark for evaluating purchase intention comprehension abilities of language models in e-commerce

W Ding, W Wang, SHD Kwok, M Liu, T Fang… - arXiv preprint arXiv …, 2024 - arxiv.org
Enhancing Language Models'(LMs) ability to understand purchase intentions in E-
commerce scenarios is crucial for their effective assistance in various downstream tasks …

Knowledge editing on black-box large language models

X Song, Z Wang, K He, G Dong, Y Mou, J Zhao… - arXiv preprint arXiv …, 2024 - arxiv.org
Knowledge editing (KE) aims to efficiently and precisely modify the behavior of large
language models (LLMs) to update specific knowledge without negatively influencing other …

CANDLE: iterative conceptualization and instantiation distillation from large language models for commonsense reasoning

W Wang, T Fang, C Li, H Shi, W Ding, B Xu… - arXiv preprint arXiv …, 2024 - arxiv.org
The sequential process of conceptualization and instantiation is essential to generalizable
commonsense reasoning as it allows the application of existing knowledge to unfamiliar …

MARS: Benchmarking the metaphysical reasoning abilities of language models with a multi-task evaluation dataset

W Wang, Y Song - arXiv preprint arXiv:2406.02106, 2024 - arxiv.org
To enable Large Language Models (LLMs) to function as conscious agents with
generalizable reasoning capabilities, it is crucial that they possess the reasoning ability to …

Kocommongen v2: A benchmark for navigating korean commonsense reasoning challenges in large language models

J Seo, J Lee, C Park, ST Hong, S Lee… - Findings of the …, 2024 - aclanthology.org
The evolution of large language models (LLMs) has culminated in a multitask model
paradigm where prompts drive the generation of user-specific outputs. However, this …