Prometheusvision: Vision-language model as a judge for fine-grained evaluation

T Xiong, X Wang, D Guo, Q Ye, H Fan, Q Gu… - arXiv preprint arXiv …, 2024 - arxiv.org

We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as
a generalist evaluator to assess performance across a wide range of multimodal tasks …

被引用次数：10 相关文章所有 3 个版本

[PDF] arxiv.org

Prometheus 2: An open source language model specialized in evaluating other language models

S Kim, J Suk, S Longpre, BY Lin, J Shin… - arXiv preprint arXiv …, 2024 - arxiv.org

Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from
various LMs. However, concerns including transparency, controllability, and affordability …

被引用次数：82 相关文章所有 2 个版本

[PDF] arxiv.org

LLaVA-o1: Let Vision Language Models Reason Step-by-Step

G Xu, P Jin, L Hao, Y Song, L Sun, L Yuan - arXiv preprint arXiv …, 2024 - arxiv.org

Large language models have demonstrated substantial advancements in reasoning
capabilities, particularly through inference-time scaling, as illustrated by models such as …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org

Vhelm: A holistic evaluation of vision language models

T Lee, H Tu, CH Wong, W Zheng, Y Zhou, Y Mai… - arXiv preprint arXiv …, 2024 - arxiv.org

Current benchmarks for assessing vision-language models (VLMs) often focus on their
perception or problem-solving capabilities and neglect other critical aspects such as …

被引用次数：4 相关文章所有 3 个版本

[PDF] arxiv.org

Recommendation with generative models

Y Deldjoo, Z He, J McAuley, A Korikov… - arXiv preprint arXiv …, 2024 - arxiv.org

Generative models are a class of AI models capable of creating new instances of data by
learning and sampling from their statistical distributions. In recent years, these models have …

被引用次数：5 相关文章所有 4 个版本

[PDF] arxiv.org

Spiqa: A dataset for multimodal question answering on scientific papers

S Pramanick, R Chellappa, S Venugopalan - arXiv preprint arXiv …, 2024 - arxiv.org

Seeking answers to questions within long scientific research articles is a crucial area of
study that aids readers in quickly addressing their inquiries. However, existing question …

被引用次数：3 相关文章所有 3 个版本

[PDF] arxiv.org

Aligning to thousands of preferences via system message generalization

S Lee, SH Park, S Kim, M Seo - arXiv preprint arXiv:2405.17977, 2024 - arxiv.org

Although humans inherently have diverse values, current large language model (LLM)
alignment methods often assume that aligning LLMs with the general public's preferences is …

被引用次数：13 相关文章所有 2 个版本

[PDF] arxiv.org

A Survey on LLM-as-a-Judge

J Gu, X Jiang, Z Shi, H Tan, X Zhai, C Xu, W Li… - arXiv preprint arXiv …, 2024 - arxiv.org

Accurate and consistent evaluation is crucial for decision-making across numerous fields,
yet it remains a challenging task due to inherent subjectivity, variability, and scale. Large …

被引用次数：1 相关文章所有 2 个版本

[PDF] arxiv.org

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

H Li, Q Dong, J Chen, H Su, Y Zhou, Q Ai, Z Ye… - arXiv preprint arXiv …, 2024 - arxiv.org

The rapid advancement of Large Language Models (LLMs) has driven their expanding
application across various fields. One of the most promising applications is their role as …

Benchmarking XAI Explanations with Human-Aligned Evaluations

R Kazmierczak, S Azzolin, E Berthier… - arXiv preprint arXiv …, 2024 - arxiv.org

In this paper, we introduce PASTA (Perceptual Assessment System for explanaTion of
Artificial intelligence), a novel framework for a human-centric evaluation of XAI techniques in …

高级搜索

QQ 群