AGIEval: A human-centric benchmark for evaluating foundation models (2023)

CausalKGPT: industrial structure causal knowledge-enhanced large language model for cause analysis of quality problems in aerospace product manufacturing

B Zhou, X Li, T Liu, K Xu, W Liu, J Bao - Advanced Engineering Informatics, 2024 - Elsevier

The whole cycle for manufacturing aerospace thin-walled shells is a lengthy and
sophisticated process. A large amount of quality-related data exists within and between …

被引用次数：31 相关文章

[PDF] arxiv.org

Evaluating Large Language Models in Process Mining: Capabilities, Benchmarks, Evaluation Strategies, and Future Challenges

A Berti, H Kourani, H Hafke, CY Li… - arXiv preprint arXiv …, 2024 - arxiv.org

Using Large Language Models (LLMs) for Process Mining (PM) tasks is becoming
increasingly essential, and initial approaches yield promising results. However, little …

被引用次数：9 相关文章所有 2 个版本

[PDF] arxiv.org

Telechat technical report

Z He, Z Wang, X Liu, S Liu, Y Yao, Y Huang… - arXiv preprint arXiv …, 2024 - arxiv.org

In this technical report, we present TeleChat, a collection of large language models (LLMs)
with parameters of 3 billion, 7 billion and 12 billion. It includes pretrained language models …

被引用次数：10 相关文章所有 3 个版本

[PDF] arxiv.org

GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond

S Zheng, Y Zhang, Y Zhu, C Xi, P Gao, X Zhou… - arXiv preprint arXiv …, 2023 - arxiv.org

With the rapid advancement of large language models (LLMs), there is a pressing need for a
comprehensive evaluation suite to assess their capabilities and limitations. Existing LLM …

被引用次数：5 相关文章所有 4 个版本

[PDF] arxiv.org

Global mmlu: Understanding and addressing cultural and linguistic biases in multilingual evaluation

S Singh, A Romanou, C Fourrier, DI Adelani… - arXiv preprint arXiv …, 2024 - arxiv.org

Cultural biases in multilingual datasets pose significant challenges for their effectiveness as
global benchmarks. These biases stem not only from language but also from the cultural …

被引用次数：2 相关文章所有 2 个版本

Evaluating Large Language Models in Process Mining: Capabilities, Benchmarks, and Evaluation Strategies

A Berti, H Kourani, H Häfke, CY Li… - … Conference on Business …, 2024 - Springer

Abstract Using Large Language Models (LLMs) for Process Mining (PM) tasks is becoming
increasingly essential, and initial approaches yield promising results. However, little …

被引用次数：2 相关文章

[PDF] arxiv.org

Project MPG: towards a generalized performance benchmark for LLM capabilities

L Spangher, T Li, WF Arnold, N Masiewicki… - arXiv preprint arXiv …, 2024 - arxiv.org

There exists an extremely wide array of LLM benchmarking tasks, whereas oftentimes a
single number is the most actionable for decision-making, especially by non-experts. No …

The CLEF 2024 Monster Track: One Lab to Rule Them All

N Ferro, J Gonzalo, J Karlgren, H Müller - European Conference on …, 2024 - Springer

Abstract Generative Artificial Intelligence (AI) and Large Language Models (LLMs) are
revolutionizing technology and society thanks to their versatility and applicability to a wide …

被引用次数：1 相关文章所有 5 个版本

[PDF] arxiv.org

Uncovering Latent Chain of Thought Vectors in Language Models

J Zhang, S Viteri - arXiv preprint arXiv:2409.14026, 2024 - arxiv.org

As language models grow more influential and trusted in our society, our ability to reliably
steer them toward favorable behaviors becomes increasingly paramount. For this, we …

高级搜索

QQ 群