ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax

Radiology narrative reports often describe characteristics of a patient's disease, including its location, size, and shape. Motivated by the recent success of multimodal learning, we hypothesized that this descriptive text could guide medical image analysis algorithms. We proposed a novel visio...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Journal of digital imaging 2024-08, Vol.37 (4), p.1652-1663
Hauptverfasser: Huemann, Zachary, Tie, Xin, Hu, Junjie, Bradshaw, Tyler J
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1663
container_issue 4
container_start_page 1652
container_title Journal of digital imaging
container_volume 37
creator Huemann, Zachary
Tie, Xin
Hu, Junjie
Bradshaw, Tyler J
description Radiology narrative reports often describe characteristics of a patient's disease, including its location, size, and shape. Motivated by the recent success of multimodal learning, we hypothesized that this descriptive text could guide medical image analysis algorithms. We proposed a novel vision-language model, ConTEXTual Net, for the task of pneumothorax segmentation on chest radiographs. ConTEXTual Net extracts language features from physician-generated free-form radiology reports using a pre-trained language model. We then introduced cross-attention between the language features and the intermediate embeddings of an encoder-decoder convolutional neural network to enable language guidance for image analysis. ConTEXTual Net was trained on the CANDID-PTX dataset consisting of 3196 positive cases of pneumothorax with segmentation annotations from 6 different physicians as well as clinical radiology reports. Using cross-validation, ConTEXTual Net achieved a Dice score of 0.716±0.016, which was similar to the degree of inter-reader variability (0.712±0.044) computed on a subset of the data. It outperformed vision-only models (Swin UNETR: 0.670±0.015, ResNet50 U-Net: 0.677±0.015, GLoRIA: 0.686±0.014, and nnUNet 0.694±0.016) and a competing vision-language model (LAVT: 0.706±0.009). Ablation studies confirmed that it was the text information that led to the performance gains. Additionally, we show that certain augmentation methods degraded ConTEXTual Net's segmentation performance by breaking the image-text concordance. We also evaluated the effects of using different language models and activation functions in the cross-attention module, highlighting the efficacy of our chosen architectural design.
doi_str_mv 10.1007/s10278-024-01051-8
format Article
fullrecord <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_11300752</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3088966971</sourcerecordid><originalsourceid>FETCH-LOGICAL-c356t-8fef6ac4ea0d44f849b16614cefc4bcf17cb7d61ba02bbe3427efd8b752202193</originalsourceid><addsrcrecordid>eNpdkV1L5DAUhoOsrDL6B_ZCCnuzN9V8NU32RmSYVWH8gBnFu5CmJ2OlbTRpF_33RkdFvUrIec7LOXkQ-kXwPsG4PIgE01LmmPIcE1yQXG6gbaq4zKli7Men-xbajfEOY8wYYUzgn2iLSS4LqdQ2Wkx9v5zdLEfTZucw_M2OsrOxHZrO1-nluomN7_O56VejWUF25mtoM-dDtoBVB_1ghlTPvMsuexg7P9z6YB530KYzbYTdt3OCrv7NltOTfH5xfDo9mueWFWLIpQMnjOVgcM25k1xVRAjCLTjLK-tIaauyFqQymFYVME5LcLWsyoJSTIliE3S4zr0fqw5qm-YJptX3oelMeNLeNPprpW9u9cr_14Sw9IUFTQl_3hKCfxghDrprooW2NT34MWqqCklVSaRI6O9v6J0fQ5_20wxLqYRIXKLomrLBxxjAfUxDsH7xptfedPKmX71pmZr2Pu_x0fJuiT0DWNWT2Q</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3088966971</pqid></control><display><type>article</type><title>ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax</title><source>MEDLINE</source><source>PubMed Central</source><source>SpringerLink Journals - AutoHoldings</source><creator>Huemann, Zachary ; Tie, Xin ; Hu, Junjie ; Bradshaw, Tyler J</creator><creatorcontrib>Huemann, Zachary ; Tie, Xin ; Hu, Junjie ; Bradshaw, Tyler J</creatorcontrib><description>Radiology narrative reports often describe characteristics of a patient's disease, including its location, size, and shape. Motivated by the recent success of multimodal learning, we hypothesized that this descriptive text could guide medical image analysis algorithms. We proposed a novel vision-language model, ConTEXTual Net, for the task of pneumothorax segmentation on chest radiographs. ConTEXTual Net extracts language features from physician-generated free-form radiology reports using a pre-trained language model. We then introduced cross-attention between the language features and the intermediate embeddings of an encoder-decoder convolutional neural network to enable language guidance for image analysis. ConTEXTual Net was trained on the CANDID-PTX dataset consisting of 3196 positive cases of pneumothorax with segmentation annotations from 6 different physicians as well as clinical radiology reports. Using cross-validation, ConTEXTual Net achieved a Dice score of 0.716±0.016, which was similar to the degree of inter-reader variability (0.712±0.044) computed on a subset of the data. It outperformed vision-only models (Swin UNETR: 0.670±0.015, ResNet50 U-Net: 0.677±0.015, GLoRIA: 0.686±0.014, and nnUNet 0.694±0.016) and a competing vision-language model (LAVT: 0.706±0.009). Ablation studies confirmed that it was the text information that led to the performance gains. Additionally, we show that certain augmentation methods degraded ConTEXTual Net's segmentation performance by breaking the image-text concordance. We also evaluated the effects of using different language models and activation functions in the cross-attention module, highlighting the efficacy of our chosen architectural design.</description><identifier>ISSN: 2948-2933</identifier><identifier>ISSN: 0897-1889</identifier><identifier>ISSN: 2948-2925</identifier><identifier>EISSN: 2948-2933</identifier><identifier>EISSN: 1618-727X</identifier><identifier>DOI: 10.1007/s10278-024-01051-8</identifier><identifier>PMID: 38485899</identifier><language>eng</language><publisher>Switzerland: Springer Nature B.V</publisher><subject>Ablation ; Algorithms ; Annotations ; Artificial neural networks ; Encoders-Decoders ; Free form ; Humans ; Image analysis ; Image degradation ; Image processing ; Image segmentation ; Language ; Machine learning ; Medical imaging ; Natural Language Processing ; Neural networks ; Neural Networks, Computer ; Performance degradation ; Performance evaluation ; Physicians ; Pneumothorax ; Pneumothorax - diagnostic imaging ; Radiography, Thoracic ; Radiology ; Segmentation</subject><ispartof>Journal of digital imaging, 2024-08, Vol.37 (4), p.1652-1663</ispartof><rights>2024. The Author(s) under exclusive licence to Society for Imaging Informatics in Medicine.</rights><rights>The Author(s) under exclusive licence to Society for Imaging Informatics in Medicine 2024. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><rights>The Author(s) under exclusive licence to Society for Imaging Informatics in Medicine 2024. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. 2024</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c356t-8fef6ac4ea0d44f849b16614cefc4bcf17cb7d61ba02bbe3427efd8b752202193</cites><orcidid>0000-0002-1472-243X</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC11300752/pdf/$$EPDF$$P50$$Gpubmedcentral$$H</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC11300752/$$EHTML$$P50$$Gpubmedcentral$$H</linktohtml><link.rule.ids>230,314,727,780,784,885,27923,27924,53790,53792</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/38485899$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Huemann, Zachary</creatorcontrib><creatorcontrib>Tie, Xin</creatorcontrib><creatorcontrib>Hu, Junjie</creatorcontrib><creatorcontrib>Bradshaw, Tyler J</creatorcontrib><title>ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax</title><title>Journal of digital imaging</title><addtitle>J Imaging Inform Med</addtitle><description>Radiology narrative reports often describe characteristics of a patient's disease, including its location, size, and shape. Motivated by the recent success of multimodal learning, we hypothesized that this descriptive text could guide medical image analysis algorithms. We proposed a novel vision-language model, ConTEXTual Net, for the task of pneumothorax segmentation on chest radiographs. ConTEXTual Net extracts language features from physician-generated free-form radiology reports using a pre-trained language model. We then introduced cross-attention between the language features and the intermediate embeddings of an encoder-decoder convolutional neural network to enable language guidance for image analysis. ConTEXTual Net was trained on the CANDID-PTX dataset consisting of 3196 positive cases of pneumothorax with segmentation annotations from 6 different physicians as well as clinical radiology reports. Using cross-validation, ConTEXTual Net achieved a Dice score of 0.716±0.016, which was similar to the degree of inter-reader variability (0.712±0.044) computed on a subset of the data. It outperformed vision-only models (Swin UNETR: 0.670±0.015, ResNet50 U-Net: 0.677±0.015, GLoRIA: 0.686±0.014, and nnUNet 0.694±0.016) and a competing vision-language model (LAVT: 0.706±0.009). Ablation studies confirmed that it was the text information that led to the performance gains. Additionally, we show that certain augmentation methods degraded ConTEXTual Net's segmentation performance by breaking the image-text concordance. We also evaluated the effects of using different language models and activation functions in the cross-attention module, highlighting the efficacy of our chosen architectural design.</description><subject>Ablation</subject><subject>Algorithms</subject><subject>Annotations</subject><subject>Artificial neural networks</subject><subject>Encoders-Decoders</subject><subject>Free form</subject><subject>Humans</subject><subject>Image analysis</subject><subject>Image degradation</subject><subject>Image processing</subject><subject>Image segmentation</subject><subject>Language</subject><subject>Machine learning</subject><subject>Medical imaging</subject><subject>Natural Language Processing</subject><subject>Neural networks</subject><subject>Neural Networks, Computer</subject><subject>Performance degradation</subject><subject>Performance evaluation</subject><subject>Physicians</subject><subject>Pneumothorax</subject><subject>Pneumothorax - diagnostic imaging</subject><subject>Radiography, Thoracic</subject><subject>Radiology</subject><subject>Segmentation</subject><issn>2948-2933</issn><issn>0897-1889</issn><issn>2948-2925</issn><issn>2948-2933</issn><issn>1618-727X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><recordid>eNpdkV1L5DAUhoOsrDL6B_ZCCnuzN9V8NU32RmSYVWH8gBnFu5CmJ2OlbTRpF_33RkdFvUrIec7LOXkQ-kXwPsG4PIgE01LmmPIcE1yQXG6gbaq4zKli7Men-xbajfEOY8wYYUzgn2iLSS4LqdQ2Wkx9v5zdLEfTZucw_M2OsrOxHZrO1-nluomN7_O56VejWUF25mtoM-dDtoBVB_1ghlTPvMsuexg7P9z6YB530KYzbYTdt3OCrv7NltOTfH5xfDo9mueWFWLIpQMnjOVgcM25k1xVRAjCLTjLK-tIaauyFqQymFYVME5LcLWsyoJSTIliE3S4zr0fqw5qm-YJptX3oelMeNLeNPprpW9u9cr_14Sw9IUFTQl_3hKCfxghDrprooW2NT34MWqqCklVSaRI6O9v6J0fQ5_20wxLqYRIXKLomrLBxxjAfUxDsH7xptfedPKmX71pmZr2Pu_x0fJuiT0DWNWT2Q</recordid><startdate>20240801</startdate><enddate>20240801</enddate><creator>Huemann, Zachary</creator><creator>Tie, Xin</creator><creator>Hu, Junjie</creator><creator>Bradshaw, Tyler J</creator><general>Springer Nature B.V</general><general>Springer International Publishing</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QO</scope><scope>7SC</scope><scope>7TK</scope><scope>8FD</scope><scope>FR3</scope><scope>JQ2</scope><scope>K9.</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>NAPCQ</scope><scope>P64</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0002-1472-243X</orcidid></search><sort><creationdate>20240801</creationdate><title>ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax</title><author>Huemann, Zachary ; Tie, Xin ; Hu, Junjie ; Bradshaw, Tyler J</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c356t-8fef6ac4ea0d44f849b16614cefc4bcf17cb7d61ba02bbe3427efd8b752202193</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Ablation</topic><topic>Algorithms</topic><topic>Annotations</topic><topic>Artificial neural networks</topic><topic>Encoders-Decoders</topic><topic>Free form</topic><topic>Humans</topic><topic>Image analysis</topic><topic>Image degradation</topic><topic>Image processing</topic><topic>Image segmentation</topic><topic>Language</topic><topic>Machine learning</topic><topic>Medical imaging</topic><topic>Natural Language Processing</topic><topic>Neural networks</topic><topic>Neural Networks, Computer</topic><topic>Performance degradation</topic><topic>Performance evaluation</topic><topic>Physicians</topic><topic>Pneumothorax</topic><topic>Pneumothorax - diagnostic imaging</topic><topic>Radiography, Thoracic</topic><topic>Radiology</topic><topic>Segmentation</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Huemann, Zachary</creatorcontrib><creatorcontrib>Tie, Xin</creatorcontrib><creatorcontrib>Hu, Junjie</creatorcontrib><creatorcontrib>Bradshaw, Tyler J</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Biotechnology Research Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>Neurosciences Abstracts</collection><collection>Technology Research Database</collection><collection>Engineering Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Health &amp; Medical Complete (Alumni)</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Nursing &amp; Allied Health Premium</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Journal of digital imaging</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Huemann, Zachary</au><au>Tie, Xin</au><au>Hu, Junjie</au><au>Bradshaw, Tyler J</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax</atitle><jtitle>Journal of digital imaging</jtitle><addtitle>J Imaging Inform Med</addtitle><date>2024-08-01</date><risdate>2024</risdate><volume>37</volume><issue>4</issue><spage>1652</spage><epage>1663</epage><pages>1652-1663</pages><issn>2948-2933</issn><issn>0897-1889</issn><issn>2948-2925</issn><eissn>2948-2933</eissn><eissn>1618-727X</eissn><abstract>Radiology narrative reports often describe characteristics of a patient's disease, including its location, size, and shape. Motivated by the recent success of multimodal learning, we hypothesized that this descriptive text could guide medical image analysis algorithms. We proposed a novel vision-language model, ConTEXTual Net, for the task of pneumothorax segmentation on chest radiographs. ConTEXTual Net extracts language features from physician-generated free-form radiology reports using a pre-trained language model. We then introduced cross-attention between the language features and the intermediate embeddings of an encoder-decoder convolutional neural network to enable language guidance for image analysis. ConTEXTual Net was trained on the CANDID-PTX dataset consisting of 3196 positive cases of pneumothorax with segmentation annotations from 6 different physicians as well as clinical radiology reports. Using cross-validation, ConTEXTual Net achieved a Dice score of 0.716±0.016, which was similar to the degree of inter-reader variability (0.712±0.044) computed on a subset of the data. It outperformed vision-only models (Swin UNETR: 0.670±0.015, ResNet50 U-Net: 0.677±0.015, GLoRIA: 0.686±0.014, and nnUNet 0.694±0.016) and a competing vision-language model (LAVT: 0.706±0.009). Ablation studies confirmed that it was the text information that led to the performance gains. Additionally, we show that certain augmentation methods degraded ConTEXTual Net's segmentation performance by breaking the image-text concordance. We also evaluated the effects of using different language models and activation functions in the cross-attention module, highlighting the efficacy of our chosen architectural design.</abstract><cop>Switzerland</cop><pub>Springer Nature B.V</pub><pmid>38485899</pmid><doi>10.1007/s10278-024-01051-8</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0002-1472-243X</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2948-2933
ispartof Journal of digital imaging, 2024-08, Vol.37 (4), p.1652-1663
issn 2948-2933
0897-1889
2948-2925
2948-2933
1618-727X
language eng
recordid cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_11300752
source MEDLINE; PubMed Central; SpringerLink Journals - AutoHoldings
subjects Ablation
Algorithms
Annotations
Artificial neural networks
Encoders-Decoders
Free form
Humans
Image analysis
Image degradation
Image processing
Image segmentation
Language
Machine learning
Medical imaging
Natural Language Processing
Neural networks
Neural Networks, Computer
Performance degradation
Performance evaluation
Physicians
Pneumothorax
Pneumothorax - diagnostic imaging
Radiography, Thoracic
Radiology
Segmentation
title ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-12T16%3A39%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=ConTEXTual%20Net:%20A%20Multimodal%20Vision-Language%20Model%20for%20Segmentation%20of%20Pneumothorax&rft.jtitle=Journal%20of%20digital%20imaging&rft.au=Huemann,%20Zachary&rft.date=2024-08-01&rft.volume=37&rft.issue=4&rft.spage=1652&rft.epage=1663&rft.pages=1652-1663&rft.issn=2948-2933&rft.eissn=2948-2933&rft_id=info:doi/10.1007/s10278-024-01051-8&rft_dat=%3Cproquest_pubme%3E3088966971%3C/proquest_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3088966971&rft_id=info:pmid/38485899&rfr_iscdi=true