Assessing Inter-Annotator Agreement for Medical Image Segmentation

Artificial Intelligence (AI)-based medical computer vision algorithm training and evaluations depend on annotations and labeling. However, variability between expert annotators introduces noise in training data that can adversely impact the performance of AI algorithms. This study aims to assess, il...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE access 2023-01, Vol.11, p.1-1
Hauptverfasser:	Yang, Feng, Zamzmi, Ghada, Angara, Sandeep, Rajaraman, Sivaramakrishnan, Aquilina, Andre, Xue, Zhiyun, Jaeger, Stefan, Papagiannakis, Emmanouil, Antani, Sameer K
Format:	Artikel
Sprache:	eng
Schlagworte:	Abnormalities agreement Agreements Algorithms Annotations Artificial intelligence Biomedical imaging Cohen’s kappa Computer vision Fleiss’ kappa Heating systems heatmap Image segmentation inter-annotator Kappa coefficient Lesions Medical imaging Quantitative analysis Reliability Reliability analysis STAPLE Training
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Artificial Intelligence (AI)-based medical computer vision algorithm training and evaluations depend on annotations and labeling. However, variability between expert annotators introduces noise in training data that can adversely impact the performance of AI algorithms. This study aims to assess, illustrate and interpret the inter-annotator agreement among multiple expert annotators when segmenting the same lesion(s)/abnormalities on medical images. We propose the use of three metrics for the qualitative and quantitative assessment of inter-annotator agreement: 1) use of a common agreement heatmap and a ranking agreement heatmap; 2) use of the extended Cohen's kappa and Fleiss' kappa coefficients for a quantitative evaluation and interpretation of inter-annotator reliability; and 3) use of the Simultaneous Truth and Performance Level Estimation (STAPLE) algorithm, as a parallel step, to generate ground truth for training AI models and compute Intersection over Union (IoU), sensitivity, and specificity to assess the inter-annotator reliability and variability. Experiments are performed on two datasets, namely cervical colposcopy images from 30 patients and chest X-ray images from 336 tuberculosis (TB) patients, to demonstrate the consistency of inter-annotator reliability assessment and the importance of combining different metrics to avoid bias assessment.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2023.3249759