Who validates the validators? aligning llm-assisted evaluation of llm outputs with human preferences

A systematic survey and critical review on evaluating large language models: Challenges, limitations, and recommendations

MTR Laskar, S Alqahtani, MS Bari… - Proceedings of the …, 2024 - aclanthology.org

Abstract Large Language Models (LLMs) have recently gained significant attention due to
their remarkable capabilities in performing diverse tasks across various domains. However …

被引用次数：8 相关文章所有 3 个版本

[PDF] arxiv.org

From llms to llm-based agents for software engineering: A survey of current, challenges and future

H Jin, L Huang, H Cai, J Yan, B Li, H Chen - arXiv preprint arXiv …, 2024 - arxiv.org

With the rise of large language models (LLMs), researchers are increasingly exploring their
applications in var ious vertical domains, such as software engineering. LLMs have …

被引用次数：22 相关文章所有 3 个版本

[PDF] researchsquare.com

Comparative analysis of finetuning strategies and automated evaluation metrics for large language models in customer service chatbots

B Ilse, F Blackwood - 2024 - researchsquare.com

Customer service chatbots have become integral to the efficient operation of many
businesses, offering scalable solutions to handle vast volumes of customer interactions …

被引用次数：41 相关文章所有 2 个版本

Managing the Twin Faces of AI: A Commentary on “Is AI Changing the World for Better or Worse?”

V Shankar - Journal of Macromarketing, 2024 - journals.sagepub.com

This commentary explores the transformative potential and inherent risks associated with
artificial intelligence (AI), particularly generative AI. It focuses on the dual nature of AI, where …

被引用次数：3 相关文章所有 2 个版本

[PDF] springer.com

The problem of alignment

T Hristova, L Magee, K Soldatic - AI & SOCIETY, 2024 - Springer

Large language models (LLMs) produce sequences learned as statistical patterns from
large corpora. Their emergent status as representatives of the advances in artificial …

被引用次数：4 相关文章所有 2 个版本

[PDF] arxiv.org

ChainBuddy: An AI Agent System for Generating LLM Pipelines

J Zhang, I Arawjo - arXiv preprint arXiv:2409.13588, 2024 - arxiv.org

As large language models (LLMs) advance, their potential applications have grown
significantly. However, it remains difficult to evaluate LLM behavior on user-specific tasks …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Regurgitative training: The value of real data in training large language models

J Zhang, D Qiao, M Yang, Q Wei - arXiv preprint arXiv:2407.12835, 2024 - arxiv.org

What happens if we train a new Large Language Model (LLM) using data that are at least
partially generated by other LLMs? The explosive success of LLMs means that a substantial …

被引用次数：3 相关文章所有 3 个版本

[PDF] arxiv.org

What you say= what you want? Teaching humans to articulate requirements for LLMs

Q Ma, W Peng, H Shen, K Koedinger, T Wu - arXiv preprint arXiv …, 2024 - arxiv.org

Prompting ChatGPT to achieve complex goals (eg, creating a customer support chatbot)
often demands meticulous prompt engineering, including aspects like fluent writing and …

被引用次数：2 相关文章所有 5 个版本

[PDF] arxiv.org

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

H Li, Q Dong, J Chen, H Su, Y Zhou, Q Ai, Z Ye… - arXiv preprint arXiv …, 2024 - arxiv.org

The rapid advancement of Large Language Models (LLMs) has driven their expanding
application across various fields. One of the most promising applications is their role as …

Limitations of the LLM-as-a-Judge Approach for Evaluating LLM Outputs in Expert Knowledge Tasks

A Szymanski, N Ziems, HA Eicher-Miller, TJJ Li… - arXiv preprint arXiv …, 2024 - arxiv.org

The potential of using Large Language Models (LLMs) themselves to evaluate LLM outputs
offers a promising method for assessing model performance across various contexts …

被引用次数：1 相关文章所有 2 个版本

高级搜索

QQ 群