Testing for significance of increased correlation with human judgment

J Gao, M Galley, L Li - The 41st international ACM SIGIR conference on …, 2018 - dl.acm.org

This tutorial surveys neural approaches to conversational AI that were developed in the last
few years. We group conversational systems into three categories:(1) question answering …

被引用次数：913 相关文章所有 16 个版本

[PDF] arxiv.org

Bertscore: Evaluating text generation with bert

T Zhang, V Kishore, F Wu, KQ Weinberger… - arXiv preprint arXiv …, 2019 - arxiv.org

We propose BERTScore, an automatic evaluation metric for text generation. Analogously to
common metrics, BERTScore computes a similarity score for each token in the candidate …

被引用次数：5602 相关文章所有 4 个版本

[PDF] arxiv.org

Re-evaluating evaluation in text summarization

M Bhandari, P Gour, A Ashfaq, P Liu… - arXiv preprint arXiv …, 2020 - arxiv.org

Automated evaluation metrics as a stand-in for manual evaluation are an essential part of
the development of text-generation tasks such as text summarization. However, while the …

被引用次数：196 相关文章所有 4 个版本

[PDF] arxiv.org

Tangled up in BLEU: Reevaluating the evaluation of automatic machine translation evaluation metrics

N Mathur, T Baldwin, T Cohn - arXiv preprint arXiv:2006.06264, 2020 - arxiv.org

Automatic metrics are fundamental for the development and evaluation of machine
translation systems. Judging whether, and to what extent, automatic metrics concur with the …

被引用次数：262 相关文章所有 5 个版本

[PDF] aclanthology.org

Results of the WMT20 metrics shared task

N Mathur, J Wei, M Freitag, Q Ma… - Proceedings of the Fifth …, 2020 - aclanthology.org

This paper presents the results of the WMT20 Metrics Shared Task. Participants were asked
to score the outputs of the translation systems competing in the WMT20 News Translation …

被引用次数：152 相关文章

[PDF] arxiv.org

INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained Feedback

W Xu, D Wang, L Pan, Z Song, M Freitag… - arXiv preprint arXiv …, 2023 - arxiv.org

Automatically evaluating the quality of language generation is critical. Although recent
learned metrics show high correlation with human judgement, these metrics can not explain …

被引用次数：81 相关文章所有 10 个版本

[PDF] dcu.ie

Results of the WMT19 metrics shared task: Segment-level and strong MT systems pose big challenges

Q Ma, JTZ Wei, O Bojar, Y Graham - 2019 - doras.dcu.ie

This paper presents the results of the WMT19 Metrics Shared Task. Participants were asked
to score the outputs of the translations systems competing in the WMT19 News Translation …

被引用次数：212 相关文章所有 8 个版本

[PDF] aclanthology.org

The first multilingual surface realisation shared task (SR'18): Overview and evaluation results

S Mille, A Belz, B Bohnet, Y Graham… - Proceedings of the …, 2018 - aclanthology.org

We report results from the SR'18 Shared Task, a new multilingual surface realisation task
organised as part of the ACL'18 Workshop on Multilingual Surface Realisation. As in its …

被引用次数：68 相关文章所有 12 个版本

[PDF] arxiv.org

Handling divergent reference texts when evaluating table-to-text generation

B Dhingra, M Faruqui, A Parikh, MW Chang… - arXiv preprint arXiv …, 2019 - arxiv.org

Automatically constructed datasets for generating text from semi-structured data (tables),
such as WikiBio, often contain reference texts that diverge from the information in the …

被引用次数：201 相关文章所有 7 个版本

[PDF] aclanthology.org

[PDF][PDF] Results of the wmt16 metrics shared task

O Bojar, Y Graham, A Kamran… - Proceedings of the First …, 2016 - aclanthology.org

This paper presents the results of the WMT16 Metrics Shared Task. We asked participants of
this task to score the outputs of the MT systems involved in the WMT16 Shared Translation …

被引用次数：258 相关文章所有 7 个版本

高级搜索

QQ 群