Leveraging large language models for nlg evaluation: Advances and challenges

Z Li, X Xu, T Shen, C Xu, JC Gu, Y Lai… - Proceedings of the …, 2024 - aclanthology.org
In the rapidly evolving domain of Natural Language Generation (NLG) evaluation,
introducing Large Language Models (LLMs) has opened new avenues for assessing …

Is chatgpt a good nlg evaluator? a preliminary study

J Wang, Y Liang, F Meng, Z Sun, H Shi, Z Li… - arXiv preprint arXiv …, 2023 - arxiv.org
Recently, the emergence of ChatGPT has attracted wide attention from the computational
linguistics community. Many prior studies have shown that ChatGPT achieves remarkable …

Benchmarking foundation models with language-model-as-an-examiner

Y Bai, J Ying, Y Cao, X Lv, Y He… - Advances in …, 2024 - proceedings.neurips.cc
Numerous benchmarks have been established to assess the performance of foundation
models on open-ended question answering, which serves as a comprehensive test of a …

Leveraging large language models for nlg evaluation: A survey

Z Li, X Xu, T Shen, C Xu, JC Gu, C Tao - arXiv e-prints, 2024 - ui.adsabs.harvard.edu
In the rapidly evolving domain of Natural Language Generation (NLG) evaluation,
introducing Large Language Models (LLMs) has opened new avenues for assessing …

What makes a good story and how can we measure it? a comprehensive survey of story evaluation

D Yang, Q Jin - arXiv preprint arXiv:2408.14622, 2024 - arxiv.org
With the development of artificial intelligence, particularly the success of Large Language
Models (LLMs), the quantity and quality of automatically generated stories have significantly …

Learning personalized story evaluation

D Wang, K Yang, H Zhu, X Yang, A Cohen, L Li… - arXiv preprint arXiv …, 2023 - arxiv.org
While large language models (LLMs) have shown impressive results for more objective
tasks such as QA and retrieval, it remains nontrivial to evaluate their performance on open …

Medqa-cs: Benchmarking large language models clinical skills using an ai-sce framework

Z Yao, Z Zhang, C Tang, X Bian, Y Zhao, Z Yang… - arXiv preprint arXiv …, 2024 - arxiv.org
Artificial intelligence (AI) and large language models (LLMs) in healthcare require advanced
clinical skills (CS), yet current benchmarks fail to evaluate these comprehensively. We …

Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation

M Boubdir, E Kim, B Ermis, M Fadaee… - arXiv preprint arXiv …, 2023 - arxiv.org
Human evaluation is increasingly critical for assessing large language models, capturing
linguistic nuances, and reflecting user preferences more accurately than traditional …

Are NLP Models Good at Tracing Thoughts: An Overview of Narrative Understanding

L Zhu, R Zhao, L Gui, Y He - arXiv preprint arXiv:2310.18783, 2023 - arxiv.org
Narrative understanding involves capturing the author's cognitive processes, providing
insights into their knowledge, intentions, beliefs, and desires. Although large language …

Corrpus: Code-based structured prompting for neurosymbolic story understanding

YR Dong, LJ Martin, C Callison-Burch - arXiv preprint arXiv:2212.10754, 2022 - arxiv.org
Story generation and understanding--as with all NLG/NLU tasks--has seen a surge in
neurosymbolic work. Researchers have recognized that, while large language models …