Active label cleaning for improved dataset quality under resource constraints

Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models and have a confounding effect on the assessment of model performance. Nevertheless, employing experts to remove label noise by fully re-annotating large datasets is infeasible in resour...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Nature communications 2022-03, Vol.13 (1), p.1161-11, Article 1161
Hauptverfasser:	Bernhardt, Mélanie, Castro, Daniel C., Tanno, Ryutaro, Schwaighofer, Anton, Tezcan, Kerem C., Monteiro, Miguel, Bannur, Shruthi, Lungren, Matthew P., Nori, Aditya, Glocker, Ben, Alvarez-Valle, Javier, Oktay, Ozan
Format:	Artikel
Sprache:	eng
Schlagworte:	631/114/1305 692/700/139 692/700/1421/1770 Benchmarking Data Curation Delivery of Health Care Diagnostic Imaging Humanities and Social Sciences Machine Learning multidisciplinary Science Science (multidisciplinary)
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models and have a confounding effect on the assessment of model performance. Nevertheless, employing experts to remove label noise by fully re-annotating large datasets is infeasible in resource-constrained settings, such as healthcare. This work advocates for a data-driven approach to prioritising samples for re-annotation—which we term “active label cleaning". We propose to rank instances according to estimated label correctness and labelling difficulty of each sample, and introduce a simulation framework to evaluate relabelling efficacy. Our experiments on natural images and on a specifically-devised medical imaging benchmark show that cleaning noisy labels mitigates their negative impact on model training, evaluation, and selection. Crucially, the proposed approach enables correcting labels up to 4 × more effectively than typical random selection in realistic conditions, making better use of experts’ valuable time for improving dataset quality. High quality labels are important for model performance, evaluation and selection in medical imaging. As manual labelling is time-consuming and costly, the authors explore and benchmark various resource-effective methods for improving dataset quality.
ISSN:	2041-1723 2041-1723
DOI:	10.1038/s41467-022-28818-3