ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax
Radiology narrative reports often describe characteristics of a patient's disease, including its location, size, and shape. Motivated by the recent success of multimodal learning, we hypothesized that this descriptive text could guide medical image analysis algorithms. We proposed a novel visio...
Gespeichert in:
Veröffentlicht in: | Journal of digital imaging 2024-08, Vol.37 (4), p.1652-1663 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 1663 |
---|---|
container_issue | 4 |
container_start_page | 1652 |
container_title | Journal of digital imaging |
container_volume | 37 |
creator | Huemann, Zachary Tie, Xin Hu, Junjie Bradshaw, Tyler J |
description | Radiology narrative reports often describe characteristics of a patient's disease, including its location, size, and shape. Motivated by the recent success of multimodal learning, we hypothesized that this descriptive text could guide medical image analysis algorithms. We proposed a novel vision-language model, ConTEXTual Net, for the task of pneumothorax segmentation on chest radiographs. ConTEXTual Net extracts language features from physician-generated free-form radiology reports using a pre-trained language model. We then introduced cross-attention between the language features and the intermediate embeddings of an encoder-decoder convolutional neural network to enable language guidance for image analysis. ConTEXTual Net was trained on the CANDID-PTX dataset consisting of 3196 positive cases of pneumothorax with segmentation annotations from 6 different physicians as well as clinical radiology reports. Using cross-validation, ConTEXTual Net achieved a Dice score of 0.716±0.016, which was similar to the degree of inter-reader variability (0.712±0.044) computed on a subset of the data. It outperformed vision-only models (Swin UNETR: 0.670±0.015, ResNet50 U-Net: 0.677±0.015, GLoRIA: 0.686±0.014, and nnUNet 0.694±0.016) and a competing vision-language model (LAVT: 0.706±0.009). Ablation studies confirmed that it was the text information that led to the performance gains. Additionally, we show that certain augmentation methods degraded ConTEXTual Net's segmentation performance by breaking the image-text concordance. We also evaluated the effects of using different language models and activation functions in the cross-attention module, highlighting the efficacy of our chosen architectural design. |
doi_str_mv | 10.1007/s10278-024-01051-8 |
format | Article |
fullrecord | <record><control><sourceid>proquest_pubme</sourceid><recordid>TN_cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_11300752</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3088966971</sourcerecordid><originalsourceid>FETCH-LOGICAL-c356t-8fef6ac4ea0d44f849b16614cefc4bcf17cb7d61ba02bbe3427efd8b752202193</originalsourceid><addsrcrecordid>eNpdkV1L5DAUhoOsrDL6B_ZCCnuzN9V8NU32RmSYVWH8gBnFu5CmJ2OlbTRpF_33RkdFvUrIec7LOXkQ-kXwPsG4PIgE01LmmPIcE1yQXG6gbaq4zKli7Men-xbajfEOY8wYYUzgn2iLSS4LqdQ2Wkx9v5zdLEfTZucw_M2OsrOxHZrO1-nluomN7_O56VejWUF25mtoM-dDtoBVB_1ghlTPvMsuexg7P9z6YB530KYzbYTdt3OCrv7NltOTfH5xfDo9mueWFWLIpQMnjOVgcM25k1xVRAjCLTjLK-tIaauyFqQymFYVME5LcLWsyoJSTIliE3S4zr0fqw5qm-YJptX3oelMeNLeNPprpW9u9cr_14Sw9IUFTQl_3hKCfxghDrprooW2NT34MWqqCklVSaRI6O9v6J0fQ5_20wxLqYRIXKLomrLBxxjAfUxDsH7xptfedPKmX71pmZr2Pu_x0fJuiT0DWNWT2Q</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3088966971</pqid></control><display><type>article</type><title>ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax</title><source>MEDLINE</source><source>PubMed Central</source><source>SpringerLink Journals - AutoHoldings</source><creator>Huemann, Zachary ; Tie, Xin ; Hu, Junjie ; Bradshaw, Tyler J</creator><creatorcontrib>Huemann, Zachary ; Tie, Xin ; Hu, Junjie ; Bradshaw, Tyler J</creatorcontrib><description>Radiology narrative reports often describe characteristics of a patient's disease, including its location, size, and shape. Motivated by the recent success of multimodal learning, we hypothesized that this descriptive text could guide medical image analysis algorithms. We proposed a novel vision-language model, ConTEXTual Net, for the task of pneumothorax segmentation on chest radiographs. ConTEXTual Net extracts language features from physician-generated free-form radiology reports using a pre-trained language model. We then introduced cross-attention between the language features and the intermediate embeddings of an encoder-decoder convolutional neural network to enable language guidance for image analysis. ConTEXTual Net was trained on the CANDID-PTX dataset consisting of 3196 positive cases of pneumothorax with segmentation annotations from 6 different physicians as well as clinical radiology reports. Using cross-validation, ConTEXTual Net achieved a Dice score of 0.716±0.016, which was similar to the degree of inter-reader variability (0.712±0.044) computed on a subset of the data. It outperformed vision-only models (Swin UNETR: 0.670±0.015, ResNet50 U-Net: 0.677±0.015, GLoRIA: 0.686±0.014, and nnUNet 0.694±0.016) and a competing vision-language model (LAVT: 0.706±0.009). Ablation studies confirmed that it was the text information that led to the performance gains. Additionally, we show that certain augmentation methods degraded ConTEXTual Net's segmentation performance by breaking the image-text concordance. We also evaluated the effects of using different language models and activation functions in the cross-attention module, highlighting the efficacy of our chosen architectural design.</description><identifier>ISSN: 2948-2933</identifier><identifier>ISSN: 0897-1889</identifier><identifier>ISSN: 2948-2925</identifier><identifier>EISSN: 2948-2933</identifier><identifier>EISSN: 1618-727X</identifier><identifier>DOI: 10.1007/s10278-024-01051-8</identifier><identifier>PMID: 38485899</identifier><language>eng</language><publisher>Switzerland: Springer Nature B.V</publisher><subject>Ablation ; Algorithms ; Annotations ; Artificial neural networks ; Encoders-Decoders ; Free form ; Humans ; Image analysis ; Image degradation ; Image processing ; Image segmentation ; Language ; Machine learning ; Medical imaging ; Natural Language Processing ; Neural networks ; Neural Networks, Computer ; Performance degradation ; Performance evaluation ; Physicians ; Pneumothorax ; Pneumothorax - diagnostic imaging ; Radiography, Thoracic ; Radiology ; Segmentation</subject><ispartof>Journal of digital imaging, 2024-08, Vol.37 (4), p.1652-1663</ispartof><rights>2024. The Author(s) under exclusive licence to Society for Imaging Informatics in Medicine.</rights><rights>The Author(s) under exclusive licence to Society for Imaging Informatics in Medicine 2024. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><rights>The Author(s) under exclusive licence to Society for Imaging Informatics in Medicine 2024. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. 2024</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c356t-8fef6ac4ea0d44f849b16614cefc4bcf17cb7d61ba02bbe3427efd8b752202193</cites><orcidid>0000-0002-1472-243X</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC11300752/pdf/$$EPDF$$P50$$Gpubmedcentral$$H</linktopdf><linktohtml>$$Uhttps://www.ncbi.nlm.nih.gov/pmc/articles/PMC11300752/$$EHTML$$P50$$Gpubmedcentral$$H</linktohtml><link.rule.ids>230,314,727,780,784,885,27923,27924,53790,53792</link.rule.ids><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/38485899$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Huemann, Zachary</creatorcontrib><creatorcontrib>Tie, Xin</creatorcontrib><creatorcontrib>Hu, Junjie</creatorcontrib><creatorcontrib>Bradshaw, Tyler J</creatorcontrib><title>ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax</title><title>Journal of digital imaging</title><addtitle>J Imaging Inform Med</addtitle><description>Radiology narrative reports often describe characteristics of a patient's disease, including its location, size, and shape. Motivated by the recent success of multimodal learning, we hypothesized that this descriptive text could guide medical image analysis algorithms. We proposed a novel vision-language model, ConTEXTual Net, for the task of pneumothorax segmentation on chest radiographs. ConTEXTual Net extracts language features from physician-generated free-form radiology reports using a pre-trained language model. We then introduced cross-attention between the language features and the intermediate embeddings of an encoder-decoder convolutional neural network to enable language guidance for image analysis. ConTEXTual Net was trained on the CANDID-PTX dataset consisting of 3196 positive cases of pneumothorax with segmentation annotations from 6 different physicians as well as clinical radiology reports. Using cross-validation, ConTEXTual Net achieved a Dice score of 0.716±0.016, which was similar to the degree of inter-reader variability (0.712±0.044) computed on a subset of the data. It outperformed vision-only models (Swin UNETR: 0.670±0.015, ResNet50 U-Net: 0.677±0.015, GLoRIA: 0.686±0.014, and nnUNet 0.694±0.016) and a competing vision-language model (LAVT: 0.706±0.009). Ablation studies confirmed that it was the text information that led to the performance gains. Additionally, we show that certain augmentation methods degraded ConTEXTual Net's segmentation performance by breaking the image-text concordance. We also evaluated the effects of using different language models and activation functions in the cross-attention module, highlighting the efficacy of our chosen architectural design.</description><subject>Ablation</subject><subject>Algorithms</subject><subject>Annotations</subject><subject>Artificial neural networks</subject><subject>Encoders-Decoders</subject><subject>Free form</subject><subject>Humans</subject><subject>Image analysis</subject><subject>Image degradation</subject><subject>Image processing</subject><subject>Image segmentation</subject><subject>Language</subject><subject>Machine learning</subject><subject>Medical imaging</subject><subject>Natural Language Processing</subject><subject>Neural networks</subject><subject>Neural Networks, Computer</subject><subject>Performance degradation</subject><subject>Performance evaluation</subject><subject>Physicians</subject><subject>Pneumothorax</subject><subject>Pneumothorax - diagnostic imaging</subject><subject>Radiography, Thoracic</subject><subject>Radiology</subject><subject>Segmentation</subject><issn>2948-2933</issn><issn>0897-1889</issn><issn>2948-2925</issn><issn>2948-2933</issn><issn>1618-727X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>EIF</sourceid><recordid>eNpdkV1L5DAUhoOsrDL6B_ZCCnuzN9V8NU32RmSYVWH8gBnFu5CmJ2OlbTRpF_33RkdFvUrIec7LOXkQ-kXwPsG4PIgE01LmmPIcE1yQXG6gbaq4zKli7Men-xbajfEOY8wYYUzgn2iLSS4LqdQ2Wkx9v5zdLEfTZucw_M2OsrOxHZrO1-nluomN7_O56VejWUF25mtoM-dDtoBVB_1ghlTPvMsuexg7P9z6YB530KYzbYTdt3OCrv7NltOTfH5xfDo9mueWFWLIpQMnjOVgcM25k1xVRAjCLTjLK-tIaauyFqQymFYVME5LcLWsyoJSTIliE3S4zr0fqw5qm-YJptX3oelMeNLeNPprpW9u9cr_14Sw9IUFTQl_3hKCfxghDrprooW2NT34MWqqCklVSaRI6O9v6J0fQ5_20wxLqYRIXKLomrLBxxjAfUxDsH7xptfedPKmX71pmZr2Pu_x0fJuiT0DWNWT2Q</recordid><startdate>20240801</startdate><enddate>20240801</enddate><creator>Huemann, Zachary</creator><creator>Tie, Xin</creator><creator>Hu, Junjie</creator><creator>Bradshaw, Tyler J</creator><general>Springer Nature B.V</general><general>Springer International Publishing</general><scope>CGR</scope><scope>CUY</scope><scope>CVF</scope><scope>ECM</scope><scope>EIF</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7QO</scope><scope>7SC</scope><scope>7TK</scope><scope>8FD</scope><scope>FR3</scope><scope>JQ2</scope><scope>K9.</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>NAPCQ</scope><scope>P64</scope><scope>7X8</scope><scope>5PM</scope><orcidid>https://orcid.org/0000-0002-1472-243X</orcidid></search><sort><creationdate>20240801</creationdate><title>ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax</title><author>Huemann, Zachary ; Tie, Xin ; Hu, Junjie ; Bradshaw, Tyler J</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c356t-8fef6ac4ea0d44f849b16614cefc4bcf17cb7d61ba02bbe3427efd8b752202193</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Ablation</topic><topic>Algorithms</topic><topic>Annotations</topic><topic>Artificial neural networks</topic><topic>Encoders-Decoders</topic><topic>Free form</topic><topic>Humans</topic><topic>Image analysis</topic><topic>Image degradation</topic><topic>Image processing</topic><topic>Image segmentation</topic><topic>Language</topic><topic>Machine learning</topic><topic>Medical imaging</topic><topic>Natural Language Processing</topic><topic>Neural networks</topic><topic>Neural Networks, Computer</topic><topic>Performance degradation</topic><topic>Performance evaluation</topic><topic>Physicians</topic><topic>Pneumothorax</topic><topic>Pneumothorax - diagnostic imaging</topic><topic>Radiography, Thoracic</topic><topic>Radiology</topic><topic>Segmentation</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Huemann, Zachary</creatorcontrib><creatorcontrib>Tie, Xin</creatorcontrib><creatorcontrib>Hu, Junjie</creatorcontrib><creatorcontrib>Bradshaw, Tyler J</creatorcontrib><collection>Medline</collection><collection>MEDLINE</collection><collection>MEDLINE (Ovid)</collection><collection>MEDLINE</collection><collection>MEDLINE</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Biotechnology Research Abstracts</collection><collection>Computer and Information Systems Abstracts</collection><collection>Neurosciences Abstracts</collection><collection>Technology Research Database</collection><collection>Engineering Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>ProQuest Health & Medical Complete (Alumni)</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>Nursing & Allied Health Premium</collection><collection>Biotechnology and BioEngineering Abstracts</collection><collection>MEDLINE - Academic</collection><collection>PubMed Central (Full Participant titles)</collection><jtitle>Journal of digital imaging</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Huemann, Zachary</au><au>Tie, Xin</au><au>Hu, Junjie</au><au>Bradshaw, Tyler J</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax</atitle><jtitle>Journal of digital imaging</jtitle><addtitle>J Imaging Inform Med</addtitle><date>2024-08-01</date><risdate>2024</risdate><volume>37</volume><issue>4</issue><spage>1652</spage><epage>1663</epage><pages>1652-1663</pages><issn>2948-2933</issn><issn>0897-1889</issn><issn>2948-2925</issn><eissn>2948-2933</eissn><eissn>1618-727X</eissn><abstract>Radiology narrative reports often describe characteristics of a patient's disease, including its location, size, and shape. Motivated by the recent success of multimodal learning, we hypothesized that this descriptive text could guide medical image analysis algorithms. We proposed a novel vision-language model, ConTEXTual Net, for the task of pneumothorax segmentation on chest radiographs. ConTEXTual Net extracts language features from physician-generated free-form radiology reports using a pre-trained language model. We then introduced cross-attention between the language features and the intermediate embeddings of an encoder-decoder convolutional neural network to enable language guidance for image analysis. ConTEXTual Net was trained on the CANDID-PTX dataset consisting of 3196 positive cases of pneumothorax with segmentation annotations from 6 different physicians as well as clinical radiology reports. Using cross-validation, ConTEXTual Net achieved a Dice score of 0.716±0.016, which was similar to the degree of inter-reader variability (0.712±0.044) computed on a subset of the data. It outperformed vision-only models (Swin UNETR: 0.670±0.015, ResNet50 U-Net: 0.677±0.015, GLoRIA: 0.686±0.014, and nnUNet 0.694±0.016) and a competing vision-language model (LAVT: 0.706±0.009). Ablation studies confirmed that it was the text information that led to the performance gains. Additionally, we show that certain augmentation methods degraded ConTEXTual Net's segmentation performance by breaking the image-text concordance. We also evaluated the effects of using different language models and activation functions in the cross-attention module, highlighting the efficacy of our chosen architectural design.</abstract><cop>Switzerland</cop><pub>Springer Nature B.V</pub><pmid>38485899</pmid><doi>10.1007/s10278-024-01051-8</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0002-1472-243X</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2948-2933 |
ispartof | Journal of digital imaging, 2024-08, Vol.37 (4), p.1652-1663 |
issn | 2948-2933 0897-1889 2948-2925 2948-2933 1618-727X |
language | eng |
recordid | cdi_pubmedcentral_primary_oai_pubmedcentral_nih_gov_11300752 |
source | MEDLINE; PubMed Central; SpringerLink Journals - AutoHoldings |
subjects | Ablation Algorithms Annotations Artificial neural networks Encoders-Decoders Free form Humans Image analysis Image degradation Image processing Image segmentation Language Machine learning Medical imaging Natural Language Processing Neural networks Neural Networks, Computer Performance degradation Performance evaluation Physicians Pneumothorax Pneumothorax - diagnostic imaging Radiography, Thoracic Radiology Segmentation |
title | ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-12T16%3A39%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_pubme&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=ConTEXTual%20Net:%20A%20Multimodal%20Vision-Language%20Model%20for%20Segmentation%20of%20Pneumothorax&rft.jtitle=Journal%20of%20digital%20imaging&rft.au=Huemann,%20Zachary&rft.date=2024-08-01&rft.volume=37&rft.issue=4&rft.spage=1652&rft.epage=1663&rft.pages=1652-1663&rft.issn=2948-2933&rft.eissn=2948-2933&rft_id=info:doi/10.1007/s10278-024-01051-8&rft_dat=%3Cproquest_pubme%3E3088966971%3C/proquest_pubme%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3088966971&rft_id=info:pmid/38485899&rfr_iscdi=true |