A systematic survey and critical review on evaluating large language models: Challenges, limitations, and recommendations

MTR Laskar, S Alqahtani, MS Bari… - Proceedings of the …, 2024 - aclanthology.org
Abstract Large Language Models (LLMs) have recently gained significant attention due to
their remarkable capabilities in performing diverse tasks across various domains. However …

From llms to llm-based agents for software engineering: A survey of current, challenges and future

H Jin, L Huang, H Cai, J Yan, B Li, H Chen - arXiv preprint arXiv …, 2024 - arxiv.org
With the rise of large language models (LLMs), researchers are increasingly exploring their
applications in var ious vertical domains, such as software engineering. LLMs have …

Comparative analysis of finetuning strategies and automated evaluation metrics for large language models in customer service chatbots

B Ilse, F Blackwood - 2024 - researchsquare.com
Customer service chatbots have become integral to the efficient operation of many
businesses, offering scalable solutions to handle vast volumes of customer interactions …

Managing the Twin Faces of AI: A Commentary on “Is AI Changing the World for Better or Worse?”

V Shankar - Journal of Macromarketing, 2024 - journals.sagepub.com
This commentary explores the transformative potential and inherent risks associated with
artificial intelligence (AI), particularly generative AI. It focuses on the dual nature of AI, where …

The problem of alignment

T Hristova, L Magee, K Soldatic - AI & SOCIETY, 2024 - Springer
Large language models (LLMs) produce sequences learned as statistical patterns from
large corpora. Their emergent status as representatives of the advances in artificial …

ChainBuddy: An AI Agent System for Generating LLM Pipelines

J Zhang, I Arawjo - arXiv preprint arXiv:2409.13588, 2024 - arxiv.org
As large language models (LLMs) advance, their potential applications have grown
significantly. However, it remains difficult to evaluate LLM behavior on user-specific tasks …

Regurgitative training: The value of real data in training large language models

J Zhang, D Qiao, M Yang, Q Wei - arXiv preprint arXiv:2407.12835, 2024 - arxiv.org
What happens if we train a new Large Language Model (LLM) using data that are at least
partially generated by other LLMs? The explosive success of LLMs means that a substantial …

What you say= what you want? Teaching humans to articulate requirements for LLMs

Q Ma, W Peng, H Shen, K Koedinger, T Wu - arXiv preprint arXiv …, 2024 - arxiv.org
Prompting ChatGPT to achieve complex goals (eg, creating a customer support chatbot)
often demands meticulous prompt engineering, including aspects like fluent writing and …

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

H Li, Q Dong, J Chen, H Su, Y Zhou, Q Ai, Z Ye… - arXiv preprint arXiv …, 2024 - arxiv.org
The rapid advancement of Large Language Models (LLMs) has driven their expanding
application across various fields. One of the most promising applications is their role as …

Limitations of the LLM-as-a-Judge Approach for Evaluating LLM Outputs in Expert Knowledge Tasks

A Szymanski, N Ziems, HA Eicher-Miller, TJJ Li… - arXiv preprint arXiv …, 2024 - arxiv.org
The potential of using Large Language Models (LLMs) themselves to evaluate LLM outputs
offers a promising method for assessing model performance across various contexts …