StoryER: Automatic story evaluation via ranking, rating and reasoning

Z Li, X Xu, T Shen, C Xu, JC Gu, Y Lai… - Proceedings of the …, 2024 - aclanthology.org

In the rapidly evolving domain of Natural Language Generation (NLG) evaluation,
introducing Large Language Models (LLMs) has opened new avenues for assessing …

被引用次数：10 相关文章所有 2 个版本

[PDF] arxiv.org

Is chatgpt a good nlg evaluator? a preliminary study

J Wang, Y Liang, F Meng, Z Sun, H Shi, Z Li… - arXiv preprint arXiv …, 2023 - arxiv.org

Recently, the emergence of ChatGPT has attracted wide attention from the computational
linguistics community. Many prior studies have shown that ChatGPT achieves remarkable …

被引用次数：339 相关文章所有 6 个版本

[PDF] neurips.cc

Benchmarking foundation models with language-model-as-an-examiner

Y Bai, J Ying, Y Cao, X Lv, Y He… - Advances in …, 2024 - proceedings.neurips.cc

Numerous benchmarks have been established to assess the performance of foundation
models on open-ended question answering, which serves as a comprehensive test of a …

被引用次数：111 相关文章所有 6 个版本

Leveraging large language models for nlg evaluation: A survey

Z Li, X Xu, T Shen, C Xu, JC Gu, C Tao - arXiv e-prints, 2024 - ui.adsabs.harvard.edu

In the rapidly evolving domain of Natural Language Generation (NLG) evaluation,
introducing Large Language Models (LLMs) has opened new avenues for assessing …

被引用次数：44 相关文章

[PDF] arxiv.org

What makes a good story and how can we measure it? a comprehensive survey of story evaluation

D Yang, Q Jin - arXiv preprint arXiv:2408.14622, 2024 - arxiv.org

With the development of artificial intelligence, particularly the success of Large Language
Models (LLMs), the quantity and quality of automatically generated stories have significantly …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Learning personalized story evaluation

D Wang, K Yang, H Zhu, X Yang, A Cohen, L Li… - arXiv preprint arXiv …, 2023 - arxiv.org

While large language models (LLMs) have shown impressive results for more objective
tasks such as QA and retrieval, it remains nontrivial to evaluate their performance on open …

被引用次数：11 相关文章所有 3 个版本

[PDF] arxiv.org

Medqa-cs: Benchmarking large language models clinical skills using an ai-sce framework

Z Yao, Z Zhang, C Tang, X Bian, Y Zhao, Z Yang… - arXiv preprint arXiv …, 2024 - arxiv.org

Artificial intelligence (AI) and large language models (LLMs) in healthcare require advanced
clinical skills (CS), yet current benchmarks fail to evaluate these comprehensively. We …

被引用次数：2 相关文章所有 2 个版本

[PDF] arxiv.org

Which Prompts Make The Difference? Data Prioritization For Efficient Human LLM Evaluation

M Boubdir, E Kim, B Ermis, M Fadaee… - arXiv preprint arXiv …, 2023 - arxiv.org

Human evaluation is increasingly critical for assessing large language models, capturing
linguistic nuances, and reflecting user preferences more accurately than traditional …

被引用次数：6 相关文章所有 2 个版本

[PDF] arxiv.org

Are NLP Models Good at Tracing Thoughts: An Overview of Narrative Understanding

L Zhu, R Zhao, L Gui, Y He - arXiv preprint arXiv:2310.18783, 2023 - arxiv.org

Narrative understanding involves capturing the author's cognitive processes, providing
insights into their knowledge, intentions, beliefs, and desires. Although large language …

被引用次数：10 相关文章所有 5 个版本

[PDF] arxiv.org

Corrpus: Code-based structured prompting for neurosymbolic story understanding

YR Dong, LJ Martin, C Callison-Burch - arXiv preprint arXiv:2212.10754, 2022 - arxiv.org

Story generation and understanding--as with all NLG/NLU tasks--has seen a surge in
neurosymbolic work. Researchers have recognized that, while large language models …

被引用次数：8 相关文章所有 7 个版本

高级搜索

QQ 群