Classification Assessment Tool: A program to measure the uncertainty of classification models in terms of class-level metrics

Accuracy assessments are important steps of classifications and get higher relevance with the soar of machine and deep learning techniques. We provided a method for quick model evaluations with several options: calculate the class level accuracy metrics for as many models and classes as needed; calc...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Applied soft computing 2024-04, Vol.155, p.111468, Article 111468
Hauptverfasser:	Szabó, Szilárd, Holb, Imre J., Abriha-Molnár, Vanda Éva, Szatmári, Gábor, Singh, Sudhir Kumar, Abriha, Dávid
Format:	Artikel
Sprache:	eng
Schlagworte:	Model evaluation Model stability Python Repetitions Testing
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Accuracy assessments are important steps of classifications and get higher relevance with the soar of machine and deep learning techniques. We provided a method for quick model evaluations with several options: calculate the class level accuracy metrics for as many models and classes as needed; calculate model stability using random subsets of the testing data. The outputs are single calculations, summaries of the repetitions, and/or all accuracy results per repetitions. Using the application, we demonstrated the possibilities of the function and analyzed the accuracies of three experiments. We found that some popular metrics, the binary Overall Accuracy, Sensitivity, Precision, and Specificity, as well as ROC curve, can provide false results when the true negative cases dominate. F1-score, Intersection over Union and the Matthews correlation coefficient were reliable in all experiments. Medians and interquartile ranges (IQR) of the repeated sampling from the testing dataset showed that IQR were small when a model was almost perfect or completely unacceptable; thus, IQR reflected the model stability, reproducibility. We found that there were no general, statistically justified relationship with the median and IQR, furthermore, correlations of accuracy metrics varied by experiments, too. Accordingly, a multi-metric evaluation is suggested instead of a single metric. •Accuracy assessments are biased by the testing dataset.•Repetitions help to quantify the uncertainty of accuracy measures.•The developed tool determines the class level accuracies with the uncertainties.•We pointed on the accuracy measures biased by many true negative cases.•F1, IOU and Matthews correlation performed well in all experiments.
ISSN:	1568-4946 1872-9681
DOI:	10.1016/j.asoc.2024.111468