Llava-critic: Learning to evaluate multimodal models

T Xiong, X Wang, D Guo, Q Ye, H Fan, Q Gu… - arXiv preprint arXiv …, 2024 - arxiv.org
We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as
a generalist evaluator to assess performance across a wide range of multimodal tasks …

Prometheus 2: An open source language model specialized in evaluating other language models

S Kim, J Suk, S Longpre, BY Lin, J Shin… - arXiv preprint arXiv …, 2024 - arxiv.org
Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from
various LMs. However, concerns including transparency, controllability, and affordability …

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

G Xu, P Jin, L Hao, Y Song, L Sun, L Yuan - arXiv preprint arXiv …, 2024 - arxiv.org
Large language models have demonstrated substantial advancements in reasoning
capabilities, particularly through inference-time scaling, as illustrated by models such as …

Vhelm: A holistic evaluation of vision language models

T Lee, H Tu, CH Wong, W Zheng, Y Zhou, Y Mai… - arXiv preprint arXiv …, 2024 - arxiv.org
Current benchmarks for assessing vision-language models (VLMs) often focus on their
perception or problem-solving capabilities and neglect other critical aspects such as …

Recommendation with generative models

Y Deldjoo, Z He, J McAuley, A Korikov… - arXiv preprint arXiv …, 2024 - arxiv.org
Generative models are a class of AI models capable of creating new instances of data by
learning and sampling from their statistical distributions. In recent years, these models have …

Spiqa: A dataset for multimodal question answering on scientific papers

S Pramanick, R Chellappa, S Venugopalan - arXiv preprint arXiv …, 2024 - arxiv.org
Seeking answers to questions within long scientific research articles is a crucial area of
study that aids readers in quickly addressing their inquiries. However, existing question …

Aligning to thousands of preferences via system message generalization

S Lee, SH Park, S Kim, M Seo - arXiv preprint arXiv:2405.17977, 2024 - arxiv.org
Although humans inherently have diverse values, current large language model (LLM)
alignment methods often assume that aligning LLMs with the general public's preferences is …

A Survey on LLM-as-a-Judge

J Gu, X Jiang, Z Shi, H Tan, X Zhai, C Xu, W Li… - arXiv preprint arXiv …, 2024 - arxiv.org
Accurate and consistent evaluation is crucial for decision-making across numerous fields,
yet it remains a challenging task due to inherent subjectivity, variability, and scale. Large …

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

H Li, Q Dong, J Chen, H Su, Y Zhou, Q Ai, Z Ye… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid advancement of Large Language Models (LLMs) has driven their expanding
application across various fields. One of the most promising applications is their role as …

Benchmarking XAI Explanations with Human-Aligned Evaluations

R Kazmierczak, S Azzolin, E Berthier… - arXiv preprint arXiv …, 2024 - arxiv.org
In this paper, we introduce PASTA (Perceptual Assessment System for explanaTion of
Artificial intelligence), a novel framework for a human-centric evaluation of XAI techniques in …