作者
Fei Tang, Wanling Gao, LuZhou Peng, Jianfeng Zhan
发表日期
2023/12/3
图书
International Symposium on Benchmarking, Measuring and Optimization
页码范围
137-152
出版商
Springer Nature Singapore
简介
Large language models (LLMs) like ChatGPT have revealed amazing intelligence. How to evaluate the question-solving abilities of LLMs and their degrees of intelligence is a hot-spot but challenging issue. First, the question-solving abilities are interlaced with different ability branches like understanding and massive knowledge categories like mathematics. Second, the inputs of questions are multimodal that may involve text and images. In addition, they may have varying levels of difficulty while lacking a unified standard to judge which one is more difficult. Third, the response format of LLMs is diverse and thus poses great challenges for result extraction and evaluation. Several benchmarks have been proposed to evaluate LLMs, yet they still exhibit significant shortcomings.
In this paper, to tackle the above challenges, we propose AGIBench—a multi-granularity, multimodal, human-referenced, and auto-scoring …
引用总数
学术搜索中的文章