Learning from multiple annotators for medical image segmentation

highlights•A novel deep CNN architecture is proposed for jointly learning the expert consensus label and the annotator’s label. The proposed architecture (Fig. 1) consists of two coupled CNNs where one estimates the expert consensus label probabilities and the other models the characteristics of ind...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Pattern recognition 2023-06, Vol.138, p.109400-None, Article 109400
Hauptverfasser: Zhang, Le, Tanno, Ryutaro, Xu, Moucheng, Huang, Yawen, Bronik, Kevin, Jin, Chen, Jacob, Joseph, Zheng, Yefeng, Shao, Ling, Ciccarelli, Olga, Barkhof, Frederik, Alexander, Daniel C.
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page None
container_issue
container_start_page 109400
container_title Pattern recognition
container_volume 138
creator Zhang, Le
Tanno, Ryutaro
Xu, Moucheng
Huang, Yawen
Bronik, Kevin
Jin, Chen
Jacob, Joseph
Zheng, Yefeng
Shao, Ling
Ciccarelli, Olga
Barkhof, Frederik
Alexander, Daniel C.
description highlights•A novel deep CNN architecture is proposed for jointly learning the expert consensus label and the annotator’s label. The proposed architecture (Fig. 1) consists of two coupled CNNs where one estimates the expert consensus label probabilities and the other models the characteristics of individual annotators (e.g., tendency to over-segmentation, mix-up between different classes, etc) by estimating the pixel-wise confusion matrices (CMs) on a per image basis. Unlike STAPLE [25] and its variants, our method models, and disentangles with deep neural networks, the complex mappings from the input images to the annotator behaviours and to the expert consensus label.•The parameters of our CNNs are “global variables” that are optimised across different image samples; this enables the model to disentangle robustly the annotators’ mistakes and the expert consensus label based on correlations between similar image samples, even when the number of available annotations is small per image (e.g., a single annotation per image). In contrast, this would not be possible with STAPLE [25] and its variants [5,8] where the annotators’ parameters are estimated on every target image separately.•This paper extends the preliminary version of our method presented at the NeurIPS Thirty-fourth Annual Conference on Neural Information Processing Systems [30], by extensively evaluating our model on a new created real-world multiple sclerosis lesion dataset (QSMSC at UCL: Queen Square Multiple Sclerosis Center at UCL, UK). This dataset is generated with manual segmentations from 4 different annotators (3 radiologists with different level skills and 1 expert to generate the expert consensus label). Additionally, we presented a comprehensive discussion about our model’s potential applications (e.g., estimate annotator’s quality and annotation’s quality), the future works we are going to explore, and the potential limitations of our model. [Display omitted] Supervised machine learning methods have been widely developed for segmentation tasks in recent years. However, the quality of labels has high impact on the predictive performance of these algorithms. This issue is particularly acute in the medical image domain, where both the cost of annotation and the inter-observer variability are high. Different human experts contribute estimates of the ”actual” segmentation labels in a typical label acquisition process, influenced by their personal biases and competency levels. The performa
doi_str_mv 10.1016/j.patcog.2023.109400
format Article
fullrecord <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_10533416</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><els_id>S0031320323001012</els_id><sourcerecordid>2871656553</sourcerecordid><originalsourceid>FETCH-LOGICAL-c441t-5b0ee2a0ed157c1d32c315dbed36d1044ff30b589405c45f8466da1a2476e8d93</originalsourceid><addsrcrecordid>eNp9kM1LxDAQxYMouK7-Bx569NI10yRt9-IH4hcseNFzSJNpzdIma9Jd8L83SxfBi6eBeW_ePH6EXAJdAIXyer3YqFH7blHQgqXVklN6RGZQVywXwItjMqOUQc4Kyk7JWYxrSqFKwozcrVAFZ12XtcEP2bDtR7vpMVPO-VGNPsSs9SEb0Fit-swOqsMsYjegS7L17pyctKqPeHGYc_Lx9Pj-8JKv3p5fH-5XueYcxlw0FLFQFA2ISoNhhWYgTIOGlQYo523LaCPqVF1oLtqal6VRoApelVibJZuT2yl3s21SG53-B9XLTUiVwrf0ysq_irOfsvM7CVQwxqFMCVeHhOC_thhHOdiose-VQ7-NsqgrKEUpkn1O-GTVwccYsP39A1Tukcu1nJDLPXI5IU9nN9MZJhA7i0FGbdHpBC-gHqXx9v-AH3DSjT4</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2871656553</pqid></control><display><type>article</type><title>Learning from multiple annotators for medical image segmentation</title><source>Elsevier ScienceDirect Journals Complete</source><creator>Zhang, Le ; Tanno, Ryutaro ; Xu, Moucheng ; Huang, Yawen ; Bronik, Kevin ; Jin, Chen ; Jacob, Joseph ; Zheng, Yefeng ; Shao, Ling ; Ciccarelli, Olga ; Barkhof, Frederik ; Alexander, Daniel C.</creator><creatorcontrib>Zhang, Le ; Tanno, Ryutaro ; Xu, Moucheng ; Huang, Yawen ; Bronik, Kevin ; Jin, Chen ; Jacob, Joseph ; Zheng, Yefeng ; Shao, Ling ; Ciccarelli, Olga ; Barkhof, Frederik ; Alexander, Daniel C.</creatorcontrib><description>highlights•A novel deep CNN architecture is proposed for jointly learning the expert consensus label and the annotator’s label. The proposed architecture (Fig. 1) consists of two coupled CNNs where one estimates the expert consensus label probabilities and the other models the characteristics of individual annotators (e.g., tendency to over-segmentation, mix-up between different classes, etc) by estimating the pixel-wise confusion matrices (CMs) on a per image basis. Unlike STAPLE [25] and its variants, our method models, and disentangles with deep neural networks, the complex mappings from the input images to the annotator behaviours and to the expert consensus label.•The parameters of our CNNs are “global variables” that are optimised across different image samples; this enables the model to disentangle robustly the annotators’ mistakes and the expert consensus label based on correlations between similar image samples, even when the number of available annotations is small per image (e.g., a single annotation per image). In contrast, this would not be possible with STAPLE [25] and its variants [5,8] where the annotators’ parameters are estimated on every target image separately.•This paper extends the preliminary version of our method presented at the NeurIPS Thirty-fourth Annual Conference on Neural Information Processing Systems [30], by extensively evaluating our model on a new created real-world multiple sclerosis lesion dataset (QSMSC at UCL: Queen Square Multiple Sclerosis Center at UCL, UK). This dataset is generated with manual segmentations from 4 different annotators (3 radiologists with different level skills and 1 expert to generate the expert consensus label). Additionally, we presented a comprehensive discussion about our model’s potential applications (e.g., estimate annotator’s quality and annotation’s quality), the future works we are going to explore, and the potential limitations of our model. [Display omitted] Supervised machine learning methods have been widely developed for segmentation tasks in recent years. However, the quality of labels has high impact on the predictive performance of these algorithms. This issue is particularly acute in the medical image domain, where both the cost of annotation and the inter-observer variability are high. Different human experts contribute estimates of the ”actual” segmentation labels in a typical label acquisition process, influenced by their personal biases and competency levels. The performance of automatic segmentation algorithms is limited when these noisy labels are used as the expert consensus label. In this work, we use two coupled CNNs to jointly learn, from purely noisy observations alone, the reliability of individual annotators and the expert consensus label distributions. The separation of the two is achieved by maximally describing the annotator’s “unreliable behavior” (we call it “maximally unreliable”) while achieving high fidelity with the noisy training data. We first create a toy segmentation dataset using MNIST and investigate the properties of the proposed algorithm. We then use three public medical imaging segmentation datasets to demonstrate our method’s efficacy, including both simulated (where necessary) and real-world annotations: 1) ISBI2015 (multiple-sclerosis lesions); 2) BraTS (brain tumors); 3) LIDC-IDRI (lung abnormalities). Finally, we create a real-world multiple sclerosis lesion dataset (QSMSC at UCL: Queen Square Multiple Sclerosis Center at UCL, UK) with manual segmentations from 4 different annotators (3 radiologists with different level skills and 1 expert to generate the expert consensus label). In all datasets, our method consistently outperforms competing methods and relevant baselines, especially when the number of annotations is small and the amount of disagreement is large. The studies also reveal that the system is capable of capturing the complicated spatial characteristics of annotators’ mistakes.</description><identifier>ISSN: 0031-3203</identifier><identifier>EISSN: 1873-5142</identifier><identifier>EISSN: 0031-3203</identifier><identifier>DOI: 10.1016/j.patcog.2023.109400</identifier><language>eng</language><publisher>Elsevier Ltd</publisher><subject>Label fusion ; Multi-Annotator ; Segmentation</subject><ispartof>Pattern recognition, 2023-06, Vol.138, p.109400-None, Article 109400</ispartof><rights>2023</rights><rights>2023 The Authors. Published by Elsevier Ltd. 2023</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c441t-5b0ee2a0ed157c1d32c315dbed36d1044ff30b589405c45f8466da1a2476e8d93</citedby><cites>FETCH-LOGICAL-c441t-5b0ee2a0ed157c1d32c315dbed36d1044ff30b589405c45f8466da1a2476e8d93</cites><orcidid>0000-0002-3848-0017</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://www.sciencedirect.com/science/article/pii/S0031320323001012$$EHTML$$P50$$Gelsevier$$Hfree_for_read</linktohtml><link.rule.ids>230,314,776,780,881,3537,27901,27902,65306</link.rule.ids></links><search><creatorcontrib>Zhang, Le</creatorcontrib><creatorcontrib>Tanno, Ryutaro</creatorcontrib><creatorcontrib>Xu, Moucheng</creatorcontrib><creatorcontrib>Huang, Yawen</creatorcontrib><creatorcontrib>Bronik, Kevin</creatorcontrib><creatorcontrib>Jin, Chen</creatorcontrib><creatorcontrib>Jacob, Joseph</creatorcontrib><creatorcontrib>Zheng, Yefeng</creatorcontrib><creatorcontrib>Shao, Ling</creatorcontrib><creatorcontrib>Ciccarelli, Olga</creatorcontrib><creatorcontrib>Barkhof, Frederik</creatorcontrib><creatorcontrib>Alexander, Daniel C.</creatorcontrib><title>Learning from multiple annotators for medical image segmentation</title><title>Pattern recognition</title><description>highlights•A novel deep CNN architecture is proposed for jointly learning the expert consensus label and the annotator’s label. The proposed architecture (Fig. 1) consists of two coupled CNNs where one estimates the expert consensus label probabilities and the other models the characteristics of individual annotators (e.g., tendency to over-segmentation, mix-up between different classes, etc) by estimating the pixel-wise confusion matrices (CMs) on a per image basis. Unlike STAPLE [25] and its variants, our method models, and disentangles with deep neural networks, the complex mappings from the input images to the annotator behaviours and to the expert consensus label.•The parameters of our CNNs are “global variables” that are optimised across different image samples; this enables the model to disentangle robustly the annotators’ mistakes and the expert consensus label based on correlations between similar image samples, even when the number of available annotations is small per image (e.g., a single annotation per image). In contrast, this would not be possible with STAPLE [25] and its variants [5,8] where the annotators’ parameters are estimated on every target image separately.•This paper extends the preliminary version of our method presented at the NeurIPS Thirty-fourth Annual Conference on Neural Information Processing Systems [30], by extensively evaluating our model on a new created real-world multiple sclerosis lesion dataset (QSMSC at UCL: Queen Square Multiple Sclerosis Center at UCL, UK). This dataset is generated with manual segmentations from 4 different annotators (3 radiologists with different level skills and 1 expert to generate the expert consensus label). Additionally, we presented a comprehensive discussion about our model’s potential applications (e.g., estimate annotator’s quality and annotation’s quality), the future works we are going to explore, and the potential limitations of our model. [Display omitted] Supervised machine learning methods have been widely developed for segmentation tasks in recent years. However, the quality of labels has high impact on the predictive performance of these algorithms. This issue is particularly acute in the medical image domain, where both the cost of annotation and the inter-observer variability are high. Different human experts contribute estimates of the ”actual” segmentation labels in a typical label acquisition process, influenced by their personal biases and competency levels. The performance of automatic segmentation algorithms is limited when these noisy labels are used as the expert consensus label. In this work, we use two coupled CNNs to jointly learn, from purely noisy observations alone, the reliability of individual annotators and the expert consensus label distributions. The separation of the two is achieved by maximally describing the annotator’s “unreliable behavior” (we call it “maximally unreliable”) while achieving high fidelity with the noisy training data. We first create a toy segmentation dataset using MNIST and investigate the properties of the proposed algorithm. We then use three public medical imaging segmentation datasets to demonstrate our method’s efficacy, including both simulated (where necessary) and real-world annotations: 1) ISBI2015 (multiple-sclerosis lesions); 2) BraTS (brain tumors); 3) LIDC-IDRI (lung abnormalities). Finally, we create a real-world multiple sclerosis lesion dataset (QSMSC at UCL: Queen Square Multiple Sclerosis Center at UCL, UK) with manual segmentations from 4 different annotators (3 radiologists with different level skills and 1 expert to generate the expert consensus label). In all datasets, our method consistently outperforms competing methods and relevant baselines, especially when the number of annotations is small and the amount of disagreement is large. The studies also reveal that the system is capable of capturing the complicated spatial characteristics of annotators’ mistakes.</description><subject>Label fusion</subject><subject>Multi-Annotator</subject><subject>Segmentation</subject><issn>0031-3203</issn><issn>1873-5142</issn><issn>0031-3203</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><recordid>eNp9kM1LxDAQxYMouK7-Bx569NI10yRt9-IH4hcseNFzSJNpzdIma9Jd8L83SxfBi6eBeW_ePH6EXAJdAIXyer3YqFH7blHQgqXVklN6RGZQVywXwItjMqOUQc4Kyk7JWYxrSqFKwozcrVAFZ12XtcEP2bDtR7vpMVPO-VGNPsSs9SEb0Fit-swOqsMsYjegS7L17pyctKqPeHGYc_Lx9Pj-8JKv3p5fH-5XueYcxlw0FLFQFA2ISoNhhWYgTIOGlQYo523LaCPqVF1oLtqal6VRoApelVibJZuT2yl3s21SG53-B9XLTUiVwrf0ysq_irOfsvM7CVQwxqFMCVeHhOC_thhHOdiose-VQ7-NsqgrKEUpkn1O-GTVwccYsP39A1Tukcu1nJDLPXI5IU9nN9MZJhA7i0FGbdHpBC-gHqXx9v-AH3DSjT4</recordid><startdate>202306</startdate><enddate>202306</enddate><creator>Zhang, Le</creator><creator>Tanno, Ryutaro</creator><creator>Xu, Moucheng</creator><creator>Huang, Yawen</creator><creator>Bronik, Kevin</creator><creator>Jin, Chen</creator><creator>Jacob, Joseph</creator><creator>Zheng, Yefeng</creator><creator>Shao, Ling</creator><creator>Ciccarelli, Olga</creator><creator>Barkhof, Frederik</creator><creator>Alexander, Daniel C.</creator><general>Elsevier Ltd</general><general>Elsevier</general><scope>6I.</scope><scope>AAFTH</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0002-3848-0017</orcidid></search><sort><creationdate>202306</creationdate><title>Learning from multiple annotators for medical image segmentation</title><author>Zhang, Le ; Tanno, Ryutaro ; Xu, Moucheng ; Huang, Yawen ; Bronik, Kevin ; Jin, Chen ; Jacob, Joseph ; Zheng, Yefeng ; Shao, Ling ; Ciccarelli, Olga ; Barkhof, Frederik ; Alexander, Daniel C.</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c441t-5b0ee2a0ed157c1d32c315dbed36d1044ff30b589405c45f8466da1a2476e8d93</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Label fusion</topic><topic>Multi-Annotator</topic><topic>Segmentation</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Le</creatorcontrib><creatorcontrib>Tanno, Ryutaro</creatorcontrib><creatorcontrib>Xu, Moucheng</creatorcontrib><creatorcontrib>Huang, Yawen</creatorcontrib><creatorcontrib>Bronik, Kevin</creatorcontrib><creatorcontrib>Jin, Chen</creatorcontrib><creatorcontrib>Jacob, Joseph</creatorcontrib><creatorcontrib>Zheng, Yefeng</creatorcontrib><creatorcontrib>Shao, Ling</creatorcontrib><creatorcontrib>Ciccarelli, Olga</creatorcontrib><creatorcontrib>Barkhof, Frederik</creatorcontrib><creatorcontrib>Alexander, Daniel C.</creatorcontrib><collection>ScienceDirect Open Access Titles</collection><collection>Elsevier:ScienceDirect:Open Access</collection><collection>CrossRef</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Pattern recognition</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhang, Le</au><au>Tanno, Ryutaro</au><au>Xu, Moucheng</au><au>Huang, Yawen</au><au>Bronik, Kevin</au><au>Jin, Chen</au><au>Jacob, Joseph</au><au>Zheng, Yefeng</au><au>Shao, Ling</au><au>Ciccarelli, Olga</au><au>Barkhof, Frederik</au><au>Alexander, Daniel C.</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Learning from multiple annotators for medical image segmentation</atitle><jtitle>Pattern recognition</jtitle><date>2023-06</date><risdate>2023</risdate><volume>138</volume><spage>109400</spage><epage>None</epage><pages>109400-None</pages><artnum>109400</artnum><issn>0031-3203</issn><eissn>1873-5142</eissn><eissn>0031-3203</eissn><abstract>highlights•A novel deep CNN architecture is proposed for jointly learning the expert consensus label and the annotator’s label. The proposed architecture (Fig. 1) consists of two coupled CNNs where one estimates the expert consensus label probabilities and the other models the characteristics of individual annotators (e.g., tendency to over-segmentation, mix-up between different classes, etc) by estimating the pixel-wise confusion matrices (CMs) on a per image basis. Unlike STAPLE [25] and its variants, our method models, and disentangles with deep neural networks, the complex mappings from the input images to the annotator behaviours and to the expert consensus label.•The parameters of our CNNs are “global variables” that are optimised across different image samples; this enables the model to disentangle robustly the annotators’ mistakes and the expert consensus label based on correlations between similar image samples, even when the number of available annotations is small per image (e.g., a single annotation per image). In contrast, this would not be possible with STAPLE [25] and its variants [5,8] where the annotators’ parameters are estimated on every target image separately.•This paper extends the preliminary version of our method presented at the NeurIPS Thirty-fourth Annual Conference on Neural Information Processing Systems [30], by extensively evaluating our model on a new created real-world multiple sclerosis lesion dataset (QSMSC at UCL: Queen Square Multiple Sclerosis Center at UCL, UK). This dataset is generated with manual segmentations from 4 different annotators (3 radiologists with different level skills and 1 expert to generate the expert consensus label). Additionally, we presented a comprehensive discussion about our model’s potential applications (e.g., estimate annotator’s quality and annotation’s quality), the future works we are going to explore, and the potential limitations of our model. [Display omitted] Supervised machine learning methods have been widely developed for segmentation tasks in recent years. However, the quality of labels has high impact on the predictive performance of these algorithms. This issue is particularly acute in the medical image domain, where both the cost of annotation and the inter-observer variability are high. Different human experts contribute estimates of the ”actual” segmentation labels in a typical label acquisition process, influenced by their personal biases and competency levels. The performance of automatic segmentation algorithms is limited when these noisy labels are used as the expert consensus label. In this work, we use two coupled CNNs to jointly learn, from purely noisy observations alone, the reliability of individual annotators and the expert consensus label distributions. The separation of the two is achieved by maximally describing the annotator’s “unreliable behavior” (we call it “maximally unreliable”) while achieving high fidelity with the noisy training data. We first create a toy segmentation dataset using MNIST and investigate the properties of the proposed algorithm. We then use three public medical imaging segmentation datasets to demonstrate our method’s efficacy, including both simulated (where necessary) and real-world annotations: 1) ISBI2015 (multiple-sclerosis lesions); 2) BraTS (brain tumors); 3) LIDC-IDRI (lung abnormalities). Finally, we create a real-world multiple sclerosis lesion dataset (QSMSC at UCL: Queen Square Multiple Sclerosis Center at UCL, UK) with manual segmentations from 4 different annotators (3 radiologists with different level skills and 1 expert to generate the expert consensus label). In all datasets, our method consistently outperforms competing methods and relevant baselines, especially when the number of annotations is small and the amount of disagreement is large. The studies also reveal that the system is capable of capturing the complicated spatial characteristics of annotators’ mistakes.</abstract><pub>Elsevier Ltd</pub><doi>10.1016/j.patcog.2023.109400</doi><orcidid>https://orcid.org/0000-0002-3848-0017</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 0031-3203
ispartof Pattern recognition, 2023-06, Vol.138, p.109400-None, Article 109400
issn 0031-3203
1873-5142
0031-3203
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_10533416
source Elsevier ScienceDirect Journals Complete
subjects Label fusion
Multi-Annotator
Segmentation
title Learning from multiple annotators for medical image segmentation
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-13T06%3A11%3A37IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Learning%20from%20multiple%20annotators%20for%20medical%20image%20segmentation&rft.jtitle=Pattern%20recognition&rft.au=Zhang,%20Le&rft.date=2023-06&rft.volume=138&rft.spage=109400&rft.epage=None&rft.pages=109400-None&rft.artnum=109400&rft.issn=0031-3203&rft.eissn=1873-5142&rft_id=info:doi/10.1016/j.patcog.2023.109400&rft_dat=%3Cproquest_pubme%3E2871656553%3C/proquest_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2871656553&rft_id=info:pmid/&rft_els_id=S0031320323001012&rfr_iscdi=true