Recommendations on compiling test datasets for evaluating artificial intelligence solutions in pathology

Artificial intelligence (AI) solutions that automatically extract information from digital histology images have shown great promise for improving pathological diagnosis. Prior to routine use, it is important to evaluate their predictive performance and obtain regulatory approval. This assessment re...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Modern pathology 2022-12, Vol.35 (12), p.1759-1769
Hauptverfasser: Homeyer, André, Geißler, Christian, Schwen, Lars Ole, Zakrzewski, Falk, Evans, Theodore, Strohmenger, Klaus, Westphal, Max, Bülow, Roman David, Kargl, Michaela, Karjauv, Aray, Munné-Bertran, Isidre, Retzlaff, Carl Orge, Romero-López, Adrià, Sołtysiński, Tomasz, Plass, Markus, Carvalho, Rita, Steinbach, Peter, Lan, Yu-Chia, Bouteldja, Nassim, Haber, David, Rojas-Carulla, Mateo, Vafaei Sadr, Alireza, Kraft, Matthias, Krüger, Daniel, Fick, Rutger, Lang, Tobias, Boor, Peter, Müller, Heimo, Hufnagl, Peter, Zerbe, Norman
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Artificial intelligence (AI) solutions that automatically extract information from digital histology images have shown great promise for improving pathological diagnosis. Prior to routine use, it is important to evaluate their predictive performance and obtain regulatory approval. This assessment requires appropriate test datasets. However, compiling such datasets is challenging and specific recommendations are missing. A committee of various stakeholders, including commercial AI developers, pathologists, and researchers, discussed key aspects and conducted extensive literature reviews on test datasets in pathology. Here, we summarize the results and derive general recommendations on compiling test datasets. We address several questions: Which and how many images are needed? How to deal with low-prevalence subsets? How can potential bias be detected? How should datasets be reported? What are the regulatory requirements in different countries? The recommendations are intended to help AI developers demonstrate the utility of their products and to help pathologists and regulatory agencies verify reported performance measures. Further research is needed to formulate criteria for sufficiently representative test datasets so that AI solutions can operate with less user intervention and better support diagnostic workflows in the future.
ISSN:0893-3952
1530-0285
1530-0285
DOI:10.1038/s41379-022-01147-y