查看文章

arxiv.org 中的 [PDF]

AGIBench: A Multi-granularity, Multimodal, Human-Referenced, Auto-Scoring Benchmark for Large Language Models

作者

Fei Tang, Wanling Gao, LuZhou Peng, Jianfeng Zhan

发表日期

2023/12/3

图书

International Symposium on Benchmarking, Measuring and Optimization

页码范围

137-152

出版商

Springer Nature Singapore

简介

Large language models (LLMs) like ChatGPT have revealed amazing intelligence. How to evaluate the question-solving abilities of LLMs and their degrees of intelligence is a hot-spot but challenging issue. First, the question-solving abilities are interlaced with different ability branches like understanding and massive knowledge categories like mathematics. Second, the inputs of questions are multimodal that may involve text and images. In addition, they may have varying levels of difficulty while lacking a unified standard to judge which one is more difficult. Third, the response format of LLMs is diverse and thus poses great challenges for result extraction and evaluation. Several benchmarks have been proposed to evaluate LLMs, yet they still exhibit significant shortcomings.

In this paper, to tackle the above challenges, we propose AGIBench—a multi-granularity, multimodal, human-referenced, and auto-scoring …

引用总数

被引用次数：2

20232

学术搜索中的文章

AGIBench: A Multi-granularity, Multimodal, Human-Referenced, Auto-Scoring Benchmark for Large Language Models

F Tang, W Gao, LZ Peng, J Zhan - International Symposium on Benchmarking, Measuring …, 2023

被引用次数：2 相关文章所有 5 个版本