Initial evaluation of State‐of‐the‐Art deep learning models on data of project forever

Aims/Purpose: We investigated state‐of‐the‐art deep learning models trained on the dataset Artificial Intelligence for Robust Glaucoma Screening Challenge [1] on a cohort subset of Project FOREVER. In addition, we investigated how the current evaluation metrics can be applied to a real‐world screeni...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Acta ophthalmologica (Oxford, England) England), 2025-01, Vol.103 (S284), p.n/a
Hauptverfasser: Reimann, Marcel, Andreasen, Jens Rovelt, Dahl, Anders Bjorholm, Kolko, Miriam
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Aims/Purpose: We investigated state‐of‐the‐art deep learning models trained on the dataset Artificial Intelligence for Robust Glaucoma Screening Challenge [1] on a cohort subset of Project FOREVER. In addition, we investigated how the current evaluation metrics can be applied to a real‐world screening scenario. Methods: We followed and combined the reported best‐performing model designs on the challenge dataset [1] to assess the generalizability of these models to our dataset. The entire pipeline was built using open‐source packages and model weights to ensure reproducibility. The optic disc segmentation and quality assessment were performed using AutoMorph [2]. Afterward, a vision transformer is used to classify the cropped images into non‐referable and referable glaucoma. The model is then applied to a labeled subset of participants of Project FOREVER. Results: Using AutoMorph resulted in a much smaller good‐quality training set compared to what is reported in the AIROGS challenge. It classified one‐fifth of the images as ungradable. In addition, it failed to segment the optic disc in numerous images. Overall, we achieved similar performances on our test split of the AIROGS dataset. Our results also showed that, although high specificity and sensitivity values can be reached, the precision scores of the algorithms were generally low. Conclusions: We suggest including precision as a standard metric to report when evaluating screening algorithms. Specificity and sensitivity are insufficient as they do not capture the economic aspects of the proposed models. Low precision algorithms might introduce high burdens on the healthcare systems due to a high number of false positive referrals. References C. de Vente et al., "AIROGS: Artificial Intelligence for Robust Glaucoma Screening Challenge," in IEEE Transactions on Medical Imaging, vol. 43, no. 1, pp. 542‐557, Jan. 2024, doi: 10.1109/TMI.2023.3313786. Zhou Y, Wagner SK, Chia MA, Zhao A, Woodward‐Court P, Xu M, Struyven R, Alexander DC, Keane PA. AutoMorph: Automated Retinal Vascular Morphology Quantification Via a Deep Learning Pipeline. Transl Vis Sci Technol. 2022 Jul 8;11(7):12. doi: 10.1167/tvst.11.7.12. PMID: 35833885; PMCID: PMC9290317.
ISSN:1755-375X
1755-3768
DOI:10.1111/aos.17116