MedFILIP: Medical Fine-Grained Language-Image Pre-Training

Medical vision-language pretraining (VLP) that leverages naturally-paired medical image-report data is crucial for medical image analysis. However, existing methods struggle to accurately characterize associations between images and diseases, leading to inaccurate or incomplete diagnostic results. I...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE journal of biomedical and health informatics 2025-01, p.1-11
Hauptverfasser: Liang, Xinjie, Li, Xiangyu, Li, Fanding, Jiang, Jie, Dong, Qing, Wang, Wei, Wang, Kuanquan, Dong, Suyu, Luo, Gongning, Li, Shuo
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 11
container_issue
container_start_page 1
container_title IEEE journal of biomedical and health informatics
container_volume
creator Liang, Xinjie
Li, Xiangyu
Li, Fanding
Jiang, Jie
Dong, Qing
Wang, Wei
Wang, Kuanquan
Dong, Suyu
Luo, Gongning
Li, Shuo
description Medical vision-language pretraining (VLP) that leverages naturally-paired medical image-report data is crucial for medical image analysis. However, existing methods struggle to accurately characterize associations between images and diseases, leading to inaccurate or incomplete diagnostic results. In this work, we propose MedFILIP, a fine-grained VLP model, introduces medical image-specific knowledge through contrastive learning, specifically: 1) An information extractor based on a large language model is proposed to decouple comprehensive disease details from reports, which excels in extracting disease deals through flexible prompt engineering, thereby effectively reducing text complexity while retaining rich information at a tiny cost. 2) A knowledge injector is proposed to construct relationships between categories and visual attributes, which help the model to make judgments based on image features, and fosters knowledge extrapolation to unfamiliar disease categories. 3) A semantic similarity matrix based on fine-grained annotations is proposed, providing smoother, information-richer labels, thus allowing fine-grained image-text alignment. 4) We validate MedFILIP on numerous datasets, e.g., RSNA-Pneumonia, NIH ChestX-ray14, VinBigData, and COVID-19. For single-label, multi-label, and fine-grained classification, our model achieves state-of-the-art performance, the classification accuracy has increased by a maximum of 6.69%.
doi_str_mv 10.1109/JBHI.2025.3528196
format Article
fullrecord <record><control><sourceid>crossref_RIE</sourceid><recordid>TN_cdi_crossref_primary_10_1109_JBHI_2025_3528196</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10836674</ieee_id><sourcerecordid>10_1109_JBHI_2025_3528196</sourcerecordid><originalsourceid>FETCH-LOGICAL-c634-89290c6de7170a7c57e7b397af835a52a6c75e431337fe957ff6b400761f6d7c3</originalsourceid><addsrcrecordid>eNpNj89Kw0AQxhdRsNQ-gOAhL7Bx_2R3dnvTYtpIxB5yX7ab2RBpo2zw4Nub0ArOYb6PmfkGfoTcc5Zzzuzj6_OuygUTKpdKGG71FVkIrg0VgpnrP89tcUtW4_jBpjLTyOoFWb9hW1Z1tV9nk-uDP2ZlPyDdJj9Jm9V-6L59h7Q6TT3bJ6TNvOqH7o7cRH8ccXXRJWnKl2azo_X7tto81TRoWVBjhWVBtwgcmIegAOEgLfhopPJKeB1AYSG5lBDRKohRHwrGQPOoWwhySfj5bUif45gwuq_Un3z6cZy5Gd_N-G7Gdxf8KfNwzvSI-O_eSK2hkL_OHFO9</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>MedFILIP: Medical Fine-Grained Language-Image Pre-Training</title><source>IEEE Electronic Library (IEL)</source><creator>Liang, Xinjie ; Li, Xiangyu ; Li, Fanding ; Jiang, Jie ; Dong, Qing ; Wang, Wei ; Wang, Kuanquan ; Dong, Suyu ; Luo, Gongning ; Li, Shuo</creator><creatorcontrib>Liang, Xinjie ; Li, Xiangyu ; Li, Fanding ; Jiang, Jie ; Dong, Qing ; Wang, Wei ; Wang, Kuanquan ; Dong, Suyu ; Luo, Gongning ; Li, Shuo</creatorcontrib><description>Medical vision-language pretraining (VLP) that leverages naturally-paired medical image-report data is crucial for medical image analysis. However, existing methods struggle to accurately characterize associations between images and diseases, leading to inaccurate or incomplete diagnostic results. In this work, we propose MedFILIP, a fine-grained VLP model, introduces medical image-specific knowledge through contrastive learning, specifically: 1) An information extractor based on a large language model is proposed to decouple comprehensive disease details from reports, which excels in extracting disease deals through flexible prompt engineering, thereby effectively reducing text complexity while retaining rich information at a tiny cost. 2) A knowledge injector is proposed to construct relationships between categories and visual attributes, which help the model to make judgments based on image features, and fosters knowledge extrapolation to unfamiliar disease categories. 3) A semantic similarity matrix based on fine-grained annotations is proposed, providing smoother, information-richer labels, thus allowing fine-grained image-text alignment. 4) We validate MedFILIP on numerous datasets, e.g., RSNA-Pneumonia, NIH ChestX-ray14, VinBigData, and COVID-19. For single-label, multi-label, and fine-grained classification, our model achieves state-of-the-art performance, the classification accuracy has increased by a maximum of 6.69%.</description><identifier>ISSN: 2168-2194</identifier><identifier>EISSN: 2168-2208</identifier><identifier>DOI: 10.1109/JBHI.2025.3528196</identifier><identifier>CODEN: IJBHA9</identifier><language>eng</language><publisher>IEEE</publisher><subject>Bioinformatics ; Complexity theory ; Contrastive learning ; CXR imaging ; Data mining ; Diseases ; Feature extraction ; fine-grained ; interpretability ; Large language models ; Medical diagnostic imaging ; Training ; vision-language pretraining ; Visualization</subject><ispartof>IEEE journal of biomedical and health informatics, 2025-01, p.1-11</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><orcidid>0009-0003-1783-2399 ; 0000-0003-1347-3491 ; 0000-0002-1874-9947 ; 0000-0001-7681-7787 ; 0000-0003-3662-0335 ; 0000-0002-5184-3230</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10836674$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10836674$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Liang, Xinjie</creatorcontrib><creatorcontrib>Li, Xiangyu</creatorcontrib><creatorcontrib>Li, Fanding</creatorcontrib><creatorcontrib>Jiang, Jie</creatorcontrib><creatorcontrib>Dong, Qing</creatorcontrib><creatorcontrib>Wang, Wei</creatorcontrib><creatorcontrib>Wang, Kuanquan</creatorcontrib><creatorcontrib>Dong, Suyu</creatorcontrib><creatorcontrib>Luo, Gongning</creatorcontrib><creatorcontrib>Li, Shuo</creatorcontrib><title>MedFILIP: Medical Fine-Grained Language-Image Pre-Training</title><title>IEEE journal of biomedical and health informatics</title><addtitle>JBHI</addtitle><description>Medical vision-language pretraining (VLP) that leverages naturally-paired medical image-report data is crucial for medical image analysis. However, existing methods struggle to accurately characterize associations between images and diseases, leading to inaccurate or incomplete diagnostic results. In this work, we propose MedFILIP, a fine-grained VLP model, introduces medical image-specific knowledge through contrastive learning, specifically: 1) An information extractor based on a large language model is proposed to decouple comprehensive disease details from reports, which excels in extracting disease deals through flexible prompt engineering, thereby effectively reducing text complexity while retaining rich information at a tiny cost. 2) A knowledge injector is proposed to construct relationships between categories and visual attributes, which help the model to make judgments based on image features, and fosters knowledge extrapolation to unfamiliar disease categories. 3) A semantic similarity matrix based on fine-grained annotations is proposed, providing smoother, information-richer labels, thus allowing fine-grained image-text alignment. 4) We validate MedFILIP on numerous datasets, e.g., RSNA-Pneumonia, NIH ChestX-ray14, VinBigData, and COVID-19. For single-label, multi-label, and fine-grained classification, our model achieves state-of-the-art performance, the classification accuracy has increased by a maximum of 6.69%.</description><subject>Bioinformatics</subject><subject>Complexity theory</subject><subject>Contrastive learning</subject><subject>CXR imaging</subject><subject>Data mining</subject><subject>Diseases</subject><subject>Feature extraction</subject><subject>fine-grained</subject><subject>interpretability</subject><subject>Large language models</subject><subject>Medical diagnostic imaging</subject><subject>Training</subject><subject>vision-language pretraining</subject><subject>Visualization</subject><issn>2168-2194</issn><issn>2168-2208</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2025</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNj89Kw0AQxhdRsNQ-gOAhL7Bx_2R3dnvTYtpIxB5yX7ab2RBpo2zw4Nub0ArOYb6PmfkGfoTcc5Zzzuzj6_OuygUTKpdKGG71FVkIrg0VgpnrP89tcUtW4_jBpjLTyOoFWb9hW1Z1tV9nk-uDP2ZlPyDdJj9Jm9V-6L59h7Q6TT3bJ6TNvOqH7o7cRH8ccXXRJWnKl2azo_X7tto81TRoWVBjhWVBtwgcmIegAOEgLfhopPJKeB1AYSG5lBDRKohRHwrGQPOoWwhySfj5bUif45gwuq_Un3z6cZy5Gd_N-G7Gdxf8KfNwzvSI-O_eSK2hkL_OHFO9</recordid><startdate>20250109</startdate><enddate>20250109</enddate><creator>Liang, Xinjie</creator><creator>Li, Xiangyu</creator><creator>Li, Fanding</creator><creator>Jiang, Jie</creator><creator>Dong, Qing</creator><creator>Wang, Wei</creator><creator>Wang, Kuanquan</creator><creator>Dong, Suyu</creator><creator>Luo, Gongning</creator><creator>Li, Shuo</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0009-0003-1783-2399</orcidid><orcidid>https://orcid.org/0000-0003-1347-3491</orcidid><orcidid>https://orcid.org/0000-0002-1874-9947</orcidid><orcidid>https://orcid.org/0000-0001-7681-7787</orcidid><orcidid>https://orcid.org/0000-0003-3662-0335</orcidid><orcidid>https://orcid.org/0000-0002-5184-3230</orcidid></search><sort><creationdate>20250109</creationdate><title>MedFILIP: Medical Fine-Grained Language-Image Pre-Training</title><author>Liang, Xinjie ; Li, Xiangyu ; Li, Fanding ; Jiang, Jie ; Dong, Qing ; Wang, Wei ; Wang, Kuanquan ; Dong, Suyu ; Luo, Gongning ; Li, Shuo</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c634-89290c6de7170a7c57e7b397af835a52a6c75e431337fe957ff6b400761f6d7c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2025</creationdate><topic>Bioinformatics</topic><topic>Complexity theory</topic><topic>Contrastive learning</topic><topic>CXR imaging</topic><topic>Data mining</topic><topic>Diseases</topic><topic>Feature extraction</topic><topic>fine-grained</topic><topic>interpretability</topic><topic>Large language models</topic><topic>Medical diagnostic imaging</topic><topic>Training</topic><topic>vision-language pretraining</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Liang, Xinjie</creatorcontrib><creatorcontrib>Li, Xiangyu</creatorcontrib><creatorcontrib>Li, Fanding</creatorcontrib><creatorcontrib>Jiang, Jie</creatorcontrib><creatorcontrib>Dong, Qing</creatorcontrib><creatorcontrib>Wang, Wei</creatorcontrib><creatorcontrib>Wang, Kuanquan</creatorcontrib><creatorcontrib>Dong, Suyu</creatorcontrib><creatorcontrib>Luo, Gongning</creatorcontrib><creatorcontrib>Li, Shuo</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><jtitle>IEEE journal of biomedical and health informatics</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Liang, Xinjie</au><au>Li, Xiangyu</au><au>Li, Fanding</au><au>Jiang, Jie</au><au>Dong, Qing</au><au>Wang, Wei</au><au>Wang, Kuanquan</au><au>Dong, Suyu</au><au>Luo, Gongning</au><au>Li, Shuo</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>MedFILIP: Medical Fine-Grained Language-Image Pre-Training</atitle><jtitle>IEEE journal of biomedical and health informatics</jtitle><stitle>JBHI</stitle><date>2025-01-09</date><risdate>2025</risdate><spage>1</spage><epage>11</epage><pages>1-11</pages><issn>2168-2194</issn><eissn>2168-2208</eissn><coden>IJBHA9</coden><abstract>Medical vision-language pretraining (VLP) that leverages naturally-paired medical image-report data is crucial for medical image analysis. However, existing methods struggle to accurately characterize associations between images and diseases, leading to inaccurate or incomplete diagnostic results. In this work, we propose MedFILIP, a fine-grained VLP model, introduces medical image-specific knowledge through contrastive learning, specifically: 1) An information extractor based on a large language model is proposed to decouple comprehensive disease details from reports, which excels in extracting disease deals through flexible prompt engineering, thereby effectively reducing text complexity while retaining rich information at a tiny cost. 2) A knowledge injector is proposed to construct relationships between categories and visual attributes, which help the model to make judgments based on image features, and fosters knowledge extrapolation to unfamiliar disease categories. 3) A semantic similarity matrix based on fine-grained annotations is proposed, providing smoother, information-richer labels, thus allowing fine-grained image-text alignment. 4) We validate MedFILIP on numerous datasets, e.g., RSNA-Pneumonia, NIH ChestX-ray14, VinBigData, and COVID-19. For single-label, multi-label, and fine-grained classification, our model achieves state-of-the-art performance, the classification accuracy has increased by a maximum of 6.69%.</abstract><pub>IEEE</pub><doi>10.1109/JBHI.2025.3528196</doi><tpages>11</tpages><orcidid>https://orcid.org/0009-0003-1783-2399</orcidid><orcidid>https://orcid.org/0000-0003-1347-3491</orcidid><orcidid>https://orcid.org/0000-0002-1874-9947</orcidid><orcidid>https://orcid.org/0000-0001-7681-7787</orcidid><orcidid>https://orcid.org/0000-0003-3662-0335</orcidid><orcidid>https://orcid.org/0000-0002-5184-3230</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 2168-2194
ispartof IEEE journal of biomedical and health informatics, 2025-01, p.1-11
issn 2168-2194
2168-2208
language eng
recordid cdi_crossref_primary_10_1109_JBHI_2025_3528196
source IEEE Electronic Library (IEL)
subjects Bioinformatics
Complexity theory
Contrastive learning
CXR imaging
Data mining
Diseases
Feature extraction
fine-grained
interpretability
Large language models
Medical diagnostic imaging
Training
vision-language pretraining
Visualization
title MedFILIP: Medical Fine-Grained Language-Image Pre-Training
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-12T16%3A05%3A31IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=MedFILIP:%20Medical%20Fine-Grained%20Language-Image%20Pre-Training&rft.jtitle=IEEE%20journal%20of%20biomedical%20and%20health%20informatics&rft.au=Liang,%20Xinjie&rft.date=2025-01-09&rft.spage=1&rft.epage=11&rft.pages=1-11&rft.issn=2168-2194&rft.eissn=2168-2208&rft.coden=IJBHA9&rft_id=info:doi/10.1109/JBHI.2025.3528196&rft_dat=%3Ccrossref_RIE%3E10_1109_JBHI_2025_3528196%3C/crossref_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10836674&rfr_iscdi=true