Difficulty and severity-oriented metrics for test prioritization in deep learning systems

H Al-Qadasi, Y Falcone… - 2023 IEEE International …, 2023 - ieeexplore.ieee.org
2023 IEEE International Conference On Artificial Intelligence …, 2023ieeexplore.ieee.org
Recently, there has been a growing trend in AI testing toward developing new test
prioritization algorithms for deep learning systems. These algorithms aim to reduce the cost
and time needed to annotate test datasets by prioritizing instances with a higher chance of
exposing faults. Various metrics have been used to evaluate the effectiveness of these
algorithms, eg, APFD, RAUC, and ATRC. However, there is a lack of research to confirm
their validity. The results indicate that the existing metrics have severe limitations. For …
Recently, there has been a growing trend in AI testing toward developing new test prioritization algorithms for deep learning systems. These algorithms aim to reduce the cost and time needed to annotate test datasets by prioritizing instances with a higher chance of exposing faults. Various metrics have been used to evaluate the effectiveness of these algorithms, e.g., APFD, RAUC, and ATRC. However, there is a lack of research to confirm their validity. The results indicate that the existing metrics have severe limitations. For example, some metrics ignore the labeling budget and prioritize the fault detection rate instead of the fault detection ratio. Moreover, others overlook the prioritization difficulty in the evaluation.As a solution, we develop a new metric (WFDR), which solves the deficiencies of previous metrics. We also draw attention to a new research area, known as severity prioritization, which emphasizes the importance of prioritizing misclassified instances according to the severity level, particularly in critical situations. Our experiments reveal that instances with high severity make up more than 20% of all misclassified instances. Thus, these instances should be prioritized when it comes to labeling. Consequently, we proposed a new metric known as (SFDR) that evaluates the effectiveness of algorithms in prioritizing high-severity instances. Our evaluations show that our proposed metrics are more effective than other existing metrics. Besides, our two metrics re-evaluate some recent algorithms and indicate that these algorithms perform poorly.
ieeexplore.ieee.org
以上显示的是最相近的搜索结果。 查看全部搜索结果