Grading of diabetic retinopathy using a pre-segmenting deep learning classification model: Validation of an automated algorithm
To validate the performance of autonomous diabetic retinopathy (DR) grading by comparing a human grader and a self-developed deep-learning (DL) algorithm with gold-standard evaluation. We included 500, 6-field retinal images graded by an expert ophthalmologist (gold standard) according to the Intern...
Gespeichert in:
Veröffentlicht in: | Acta ophthalmologica (Oxford, England) England), 2024-10 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | To validate the performance of autonomous diabetic retinopathy (DR) grading by comparing a human grader and a self-developed deep-learning (DL) algorithm with gold-standard evaluation.
We included 500, 6-field retinal images graded by an expert ophthalmologist (gold standard) according to the International Clinical Diabetic Retinopathy Disease Severity Scale as represented with DR levels 0-4 (97, 100, 100, 103, 100, respectively). Weighted kappa was calculated to measure the DR classification agreement for (1) a certified human grader without, and (2) with assistance from a DL algorithm and (3) the DL operating autonomously. Using any DR (level 0 vs. 1-4) as a cutoff, we calculated sensitivity, specificity, as well as positive and negative predictive values (PPV and NPV). Finally, we assessed lesion discrepancies between Model 3 and the gold standard.
As compared to the gold standard, weighted kappa for Models 1-3 was 0.88, 0.89 and 0.72, sensitivities were 95%, 94% and 78% and specificities were 82%, 84% and 81%. Extrapolating to a real-world DR prevalence of 23.8%, the PPV were 63%, 64% and 57% and the NPV were 98%, 98% and 92%. Discrepancies between the gold standard and Model 3 were mainly incorrect detection of artefacts (n = 49), missed microaneurysms (n = 26) and inconsistencies between the segmentation and classification (n = 51).
While the autonomous DL algorithm for DR classification only performed on par with a human grader for some measures in a high-risk population, extrapolations to a real-world population demonstrated an excellent 92% NPV, which could make it clinically feasible to use autonomously to identify non-DR patients. |
---|---|
ISSN: | 1755-375X 1755-3768 1755-3768 |
DOI: | 10.1111/aos.16781 |