IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-training

In the field of medical Vision-Language Pretraining (VLP), significant efforts have been devoted to deriving text and image features from both clinical reports and associated medical images. However, most existing methods may have overlooked the opportunity in leveraging the inherent hierarchical st...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on medical imaging 2024-08, Vol.PP, p.1-1
Hauptverfasser: Liu, Che, Cheng, Sibo, Shi, Miaojing, Shah, Anand, Bai, Wenjia, Arcucci, Rossella
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1
container_issue
container_start_page 1
container_title IEEE transactions on medical imaging
container_volume PP
creator Liu, Che
Cheng, Sibo
Shi, Miaojing
Shah, Anand
Bai, Wenjia
Arcucci, Rossella
description In the field of medical Vision-Language Pretraining (VLP), significant efforts have been devoted to deriving text and image features from both clinical reports and associated medical images. However, most existing methods may have overlooked the opportunity in leveraging the inherent hierarchical structure of clinical reports, which are generally split into 'findings' for descriptive content and 'impressions' for conclusive observation. Instead of utilizing this rich, structured format, current medical VLP approaches often simplify the report into either a unified entity or fragmented tokens. In this work, we propose a novel clinical prior guided VLP framework named IMITATE to learn the structure information from medical reports with hierarchical vision-language alignment. The framework derives multi-level visual features from the chest X-ray (CXR) images and separately aligns these features with the descriptive and the conclusive text encoded in the hierarchical medical report. Furthermore, a new clinical-informed contrastive loss is introduced for cross-modal learning, which accounts for clinical prior knowledge in formulating sample correlations in contrastive learning. The proposed model, IMITATE, outperforms baseline VLP methods across six different datasets, spanning five medical imaging downstream tasks. Comprehensive experimental results highlight the advantages of integrating the hierarchical structure of medical reports for vision-language alignment.
doi_str_mv 10.1109/TMI.2024.3449690
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_pubmed_primary_39186435</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10646593</ieee_id><sourcerecordid>3097492194</sourcerecordid><originalsourceid>FETCH-LOGICAL-i775-4950f1267b953851b0753f03d56fa532a16f7ca0b370d11be5182a7f26731343</originalsourceid><addsrcrecordid>eNo9z81Lw0AQBfBFFFurdw8iOXpJndnPrLcSahtoUTCIt7BJNnUlTeqmOfjfG231NIf3ew-GkGuEKSLo-3SdTClQPmWca6nhhIxRiCikgr-dkjFQFYUAko7IRdd9ACAXoM_JiGmMJGdiTJbJOkln6fwhiGvXuMLUwbN3rQ8WvSttGSyd9cYX77_Jq-tc24Qr02x6s7GDtOHem6HXbC7JWWXqzl4d74S8PM7TeBmunhZJPFuFTikRci2gQipVrgWLBOagBKuAlUJWRjBqUFaqMJAzBSVibgVG1KhqaDBknE3I3WF159vP3nb7bOu6wta1aWzbdxkDrbimqH_o7ZH2-daW2c67rfFf2d_vA7g5AGet_Y8RJJdCM_YNqDVhsw</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3097492194</pqid></control><display><type>article</type><title>IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-training</title><source>IEEE Electronic Library (IEL)</source><creator>Liu, Che ; Cheng, Sibo ; Shi, Miaojing ; Shah, Anand ; Bai, Wenjia ; Arcucci, Rossella</creator><creatorcontrib>Liu, Che ; Cheng, Sibo ; Shi, Miaojing ; Shah, Anand ; Bai, Wenjia ; Arcucci, Rossella</creatorcontrib><description>In the field of medical Vision-Language Pretraining (VLP), significant efforts have been devoted to deriving text and image features from both clinical reports and associated medical images. However, most existing methods may have overlooked the opportunity in leveraging the inherent hierarchical structure of clinical reports, which are generally split into 'findings' for descriptive content and 'impressions' for conclusive observation. Instead of utilizing this rich, structured format, current medical VLP approaches often simplify the report into either a unified entity or fragmented tokens. In this work, we propose a novel clinical prior guided VLP framework named IMITATE to learn the structure information from medical reports with hierarchical vision-language alignment. The framework derives multi-level visual features from the chest X-ray (CXR) images and separately aligns these features with the descriptive and the conclusive text encoded in the hierarchical medical report. Furthermore, a new clinical-informed contrastive loss is introduced for cross-modal learning, which accounts for clinical prior knowledge in formulating sample correlations in contrastive learning. The proposed model, IMITATE, outperforms baseline VLP methods across six different datasets, spanning five medical imaging downstream tasks. Comprehensive experimental results highlight the advantages of integrating the hierarchical structure of medical reports for vision-language alignment.</description><identifier>ISSN: 0278-0062</identifier><identifier>ISSN: 1558-254X</identifier><identifier>EISSN: 1558-254X</identifier><identifier>DOI: 10.1109/TMI.2024.3449690</identifier><identifier>PMID: 39186435</identifier><identifier>CODEN: ITMID4</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Chest X-ray Image Analysis ; Feature extraction ; Medical diagnostic imaging ; Self-supervised Learning ; Semantics ; Task analysis ; Training ; Vision-Language Pre-training ; Visualization ; X-ray imaging</subject><ispartof>IEEE transactions on medical imaging, 2024-08, Vol.PP, p.1-1</ispartof><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0000-0002-8707-2589 ; 0000-0002-9471-0585 ; 0009-0004-3738-7998 ; 0000-0002-4933-0073 ; 0000-0003-2943-7698</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10646593$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10646593$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/39186435$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Liu, Che</creatorcontrib><creatorcontrib>Cheng, Sibo</creatorcontrib><creatorcontrib>Shi, Miaojing</creatorcontrib><creatorcontrib>Shah, Anand</creatorcontrib><creatorcontrib>Bai, Wenjia</creatorcontrib><creatorcontrib>Arcucci, Rossella</creatorcontrib><title>IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-training</title><title>IEEE transactions on medical imaging</title><addtitle>TMI</addtitle><addtitle>IEEE Trans Med Imaging</addtitle><description>In the field of medical Vision-Language Pretraining (VLP), significant efforts have been devoted to deriving text and image features from both clinical reports and associated medical images. However, most existing methods may have overlooked the opportunity in leveraging the inherent hierarchical structure of clinical reports, which are generally split into 'findings' for descriptive content and 'impressions' for conclusive observation. Instead of utilizing this rich, structured format, current medical VLP approaches often simplify the report into either a unified entity or fragmented tokens. In this work, we propose a novel clinical prior guided VLP framework named IMITATE to learn the structure information from medical reports with hierarchical vision-language alignment. The framework derives multi-level visual features from the chest X-ray (CXR) images and separately aligns these features with the descriptive and the conclusive text encoded in the hierarchical medical report. Furthermore, a new clinical-informed contrastive loss is introduced for cross-modal learning, which accounts for clinical prior knowledge in formulating sample correlations in contrastive learning. The proposed model, IMITATE, outperforms baseline VLP methods across six different datasets, spanning five medical imaging downstream tasks. Comprehensive experimental results highlight the advantages of integrating the hierarchical structure of medical reports for vision-language alignment.</description><subject>Chest X-ray Image Analysis</subject><subject>Feature extraction</subject><subject>Medical diagnostic imaging</subject><subject>Self-supervised Learning</subject><subject>Semantics</subject><subject>Task analysis</subject><subject>Training</subject><subject>Vision-Language Pre-training</subject><subject>Visualization</subject><subject>X-ray imaging</subject><issn>0278-0062</issn><issn>1558-254X</issn><issn>1558-254X</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9z81Lw0AQBfBFFFurdw8iOXpJndnPrLcSahtoUTCIt7BJNnUlTeqmOfjfG231NIf3ew-GkGuEKSLo-3SdTClQPmWca6nhhIxRiCikgr-dkjFQFYUAko7IRdd9ACAXoM_JiGmMJGdiTJbJOkln6fwhiGvXuMLUwbN3rQ8WvSttGSyd9cYX77_Jq-tc24Qr02x6s7GDtOHem6HXbC7JWWXqzl4d74S8PM7TeBmunhZJPFuFTikRci2gQipVrgWLBOagBKuAlUJWRjBqUFaqMJAzBSVibgVG1KhqaDBknE3I3WF159vP3nb7bOu6wta1aWzbdxkDrbimqH_o7ZH2-daW2c67rfFf2d_vA7g5AGet_Y8RJJdCM_YNqDVhsw</recordid><startdate>20240826</startdate><enddate>20240826</enddate><creator>Liu, Che</creator><creator>Cheng, Sibo</creator><creator>Shi, Miaojing</creator><creator>Shah, Anand</creator><creator>Bai, Wenjia</creator><creator>Arcucci, Rossella</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0002-8707-2589</orcidid><orcidid>https://orcid.org/0000-0002-9471-0585</orcidid><orcidid>https://orcid.org/0009-0004-3738-7998</orcidid><orcidid>https://orcid.org/0000-0002-4933-0073</orcidid><orcidid>https://orcid.org/0000-0003-2943-7698</orcidid></search><sort><creationdate>20240826</creationdate><title>IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-training</title><author>Liu, Che ; Cheng, Sibo ; Shi, Miaojing ; Shah, Anand ; Bai, Wenjia ; Arcucci, Rossella</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-i775-4950f1267b953851b0753f03d56fa532a16f7ca0b370d11be5182a7f26731343</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Chest X-ray Image Analysis</topic><topic>Feature extraction</topic><topic>Medical diagnostic imaging</topic><topic>Self-supervised Learning</topic><topic>Semantics</topic><topic>Task analysis</topic><topic>Training</topic><topic>Vision-Language Pre-training</topic><topic>Visualization</topic><topic>X-ray imaging</topic><toplevel>online_resources</toplevel><creatorcontrib>Liu, Che</creatorcontrib><creatorcontrib>Cheng, Sibo</creatorcontrib><creatorcontrib>Shi, Miaojing</creatorcontrib><creatorcontrib>Shah, Anand</creatorcontrib><creatorcontrib>Bai, Wenjia</creatorcontrib><creatorcontrib>Arcucci, Rossella</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>PubMed</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transactions on medical imaging</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Liu, Che</au><au>Cheng, Sibo</au><au>Shi, Miaojing</au><au>Shah, Anand</au><au>Bai, Wenjia</au><au>Arcucci, Rossella</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-training</atitle><jtitle>IEEE transactions on medical imaging</jtitle><stitle>TMI</stitle><addtitle>IEEE Trans Med Imaging</addtitle><date>2024-08-26</date><risdate>2024</risdate><volume>PP</volume><spage>1</spage><epage>1</epage><pages>1-1</pages><issn>0278-0062</issn><issn>1558-254X</issn><eissn>1558-254X</eissn><coden>ITMID4</coden><abstract>In the field of medical Vision-Language Pretraining (VLP), significant efforts have been devoted to deriving text and image features from both clinical reports and associated medical images. However, most existing methods may have overlooked the opportunity in leveraging the inherent hierarchical structure of clinical reports, which are generally split into 'findings' for descriptive content and 'impressions' for conclusive observation. Instead of utilizing this rich, structured format, current medical VLP approaches often simplify the report into either a unified entity or fragmented tokens. In this work, we propose a novel clinical prior guided VLP framework named IMITATE to learn the structure information from medical reports with hierarchical vision-language alignment. The framework derives multi-level visual features from the chest X-ray (CXR) images and separately aligns these features with the descriptive and the conclusive text encoded in the hierarchical medical report. Furthermore, a new clinical-informed contrastive loss is introduced for cross-modal learning, which accounts for clinical prior knowledge in formulating sample correlations in contrastive learning. The proposed model, IMITATE, outperforms baseline VLP methods across six different datasets, spanning five medical imaging downstream tasks. Comprehensive experimental results highlight the advantages of integrating the hierarchical structure of medical reports for vision-language alignment.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>39186435</pmid><doi>10.1109/TMI.2024.3449690</doi><tpages>1</tpages><orcidid>https://orcid.org/0000-0002-8707-2589</orcidid><orcidid>https://orcid.org/0000-0002-9471-0585</orcidid><orcidid>https://orcid.org/0009-0004-3738-7998</orcidid><orcidid>https://orcid.org/0000-0002-4933-0073</orcidid><orcidid>https://orcid.org/0000-0003-2943-7698</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 0278-0062
ispartof IEEE transactions on medical imaging, 2024-08, Vol.PP, p.1-1
issn 0278-0062
1558-254X
1558-254X
language eng
recordid cdi_pubmed_primary_39186435
source IEEE Electronic Library (IEL)
subjects Chest X-ray Image Analysis
Feature extraction
Medical diagnostic imaging
Self-supervised Learning
Semantics
Task analysis
Training
Vision-Language Pre-training
Visualization
X-ray imaging
title IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-training
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-21T03%3A09%3A52IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=IMITATE:%20Clinical%20Prior%20Guided%20Hierarchical%20Vision-Language%20Pre-training&rft.jtitle=IEEE%20transactions%20on%20medical%20imaging&rft.au=Liu,%20Che&rft.date=2024-08-26&rft.volume=PP&rft.spage=1&rft.epage=1&rft.pages=1-1&rft.issn=0278-0062&rft.eissn=1558-254X&rft.coden=ITMID4&rft_id=info:doi/10.1109/TMI.2024.3449690&rft_dat=%3Cproquest_RIE%3E3097492194%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3097492194&rft_id=info:pmid/39186435&rft_ieee_id=10646593&rfr_iscdi=true