Agreement is overrated: A plea for correlation to assess human evaluation reliability

J Amidei, P Piwek, A Willis - 2019 - oro.open.ac.uk
2019oro.open.ac.uk
Inter-Annotator Agreement (IAA) is used as a means of assessing the quality of NLG
evaluation data, in particular, its reliability. According to existing scales of IAA interpretation–
see, for example, Lommel et al.(2014), Liu et al.(2016), Sedoc et al.(2018) and Amidei et
al.(2018a)–most data collected for NLG evaluation fail the reliability test. We confirmed this
trend by analysing papers published over the last 10 years in NLG-specific conferences (in
total 135 papers that included some sort of human evaluation study). Following Sampson …
Inter-Annotator Agreement (IAA) is used as a means of assessing the quality of NLG evaluation data, in particular, its reliability. According to existing scales of IAA interpretation – see, for example, Lommel et al. (2014), Liu et al. (2016), Sedoc et al. (2018) and Amidei et al. (2018a) – most data collected for NLG evaluation fail the reliability test. We confirmed this trend by analysing papers published over the last 10 years in NLG-specific conferences (in total 135 papers that included some sort of human evaluation study). Following Sampson and Babarczy (2008), Lommel et al. (2014), Joshi et al. (2016) and Amidei et al. (2018b), such phenomena can be explained in terms of irreducible human language variability. Using three case studies, we show the limits of considering IAA as the only criterion for checking evaluation reliability. Given human language variability, we propose that for human evaluation of NLG, correlation coefficients and agreement coefficients should be used together to obtain a better assessment of the evaluation data reliability. This is illustrated using the three case studies.
oro.open.ac.uk
以上显示的是最相近的搜索结果。 查看全部搜索结果