Nphardeval: Dynamic benchmark on reasoning ability of large language models via complexity classes

L Fan, L Li, Z Ma, S Lee, H Yu, L Hemphill - arXiv preprint arXiv …, 2023 - arxiv.org

Large language models (LLMs) are a class of language models that have demonstrated
outstanding performance across a range of natural language processing (NLP) tasks and …

被引用次数：89 相关文章所有 4 个版本

[PDF] arxiv.org

Trends and challenges of real-time learning in large language models: A critical review

M Jovanovic, P Voss - arXiv preprint arXiv:2404.18311, 2024 - arxiv.org

Real-time learning concerns the ability of learning systems to acquire knowledge over time,
enabling their adaptation and generalization to novel tasks. It is a critical ability for …

被引用次数：9 相关文章所有 2 个版本

[PDF] arxiv.org

Dyval 2: Dynamic evaluation of large language models by meta probing agents

K Zhu, J Wang, Q Zhao, R Xu, X Xie - arXiv preprint arXiv:2402.14865, 2024 - arxiv.org

Evaluation of large language models (LLMs) has raised great concerns in the community
due to the issue of data contamination. Existing work designed evaluation protocols using …

被引用次数：8 相关文章所有 2 个版本

[PDF] arxiv.org

On catastrophic inheritance of large foundation models

H Chen, B Raj, X Xie, J Wang - arXiv preprint arXiv:2402.01909, 2024 - arxiv.org

Large foundation models (LFMs) are claiming incredible performances. Yet great concerns
have been raised about their mythic and uninterpreted potentials not only in machine …

被引用次数：7 相关文章所有 2 个版本

[PDF] openreview.net

Dynamic Evaluation of Large Language Models by Meta Probing Agents

K Zhu, J Wang, Q Zhao, R Xu, X Xie - Forty-first International …, 2024 - openreview.net

Evaluation of large language models (LLMs) has raised great concerns in the community
due to the issue of data contamination. Existing work designed evaluation protocols using …

被引用次数：1 相关文章

[PDF] arxiv.org

NPHardEval4V: A Dynamic Reasoning Benchmark of Multimodal Large Language Models

L Fan, W Hua, X Li, K Zhu, M Jin, L Li, H Ling… - arXiv preprint arXiv …, 2024 - arxiv.org

Understanding the reasoning capabilities of Multimodal Large Language Models (MLLMs) is
an important area of research. In this study, we introduce a dynamic benchmark …

被引用次数：5 相关文章所有 2 个版本

[PDF] arxiv.org

CausalBench: A Comprehensive Benchmark for Causal Learning Capability of Large Language Models

Y Zhou, X Wu, B Huang, J Wu, L Feng… - arXiv preprint arXiv …, 2024 - arxiv.org

Causality reveals fundamental principles behind data distributions in real-world scenarios,
and the capability of large language models (LLMs) to understand causality directly impacts …

被引用次数：3 相关文章所有 2 个版本

[PDF] arxiv.org

MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures

J Ni, F Xue, X Yue, Y Deng, M Shah, K Jain… - arXiv preprint arXiv …, 2024 - arxiv.org

Evaluating large language models (LLMs) is challenging. Traditional ground-truth-based
benchmarks fail to capture the comprehensiveness and nuance of real-world queries, while …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org

Large Language Models in Biomedical and Health Informatics: A Bibliometric Review

H Yu, L Fan, L Li, J Zhou, Z Ma, L Xian, W Hua… - arXiv preprint arXiv …, 2024 - arxiv.org

Large Language Models (LLMs) have rapidly become important tools in Biomedical and
Health Informatics (BHI), enabling new ways to analyze data, treat patients, and conduct …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Disentangling Logic: The Role of Context in Large Language Model Reasoning Capabilities

W Hua, K Zhu, L Li, L Fan, S Lin, M Jin, H Xue… - arXiv preprint arXiv …, 2024 - arxiv.org

This study intends to systematically disentangle pure logic reasoning and text understanding
by investigating the contrast across abstract and contextualized logical problems from a …

被引用次数：2 相关文章所有 2 个版本

高级搜索

QQ 群