ELVIS: Empowering Locality of Vision Language Pre-training with Intra-modal Similarity

Deep learning has shown great potential in assisting radiologists in reading chest X-ray (CXR) images, but its need for expensive annotations for improving performance prevents widespread clinical application. Visual language pre-training (VLP) can alleviate the burden and cost of annotation by leve...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2023-07
Hauptverfasser: Seo, Sumin, Shin, JaeWoong, Kang, Jaewoo, Kim, Tae Soo, Kooi, Thijs
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Seo, Sumin
Shin, JaeWoong
Kang, Jaewoo
Kim, Tae Soo
Kooi, Thijs
description Deep learning has shown great potential in assisting radiologists in reading chest X-ray (CXR) images, but its need for expensive annotations for improving performance prevents widespread clinical application. Visual language pre-training (VLP) can alleviate the burden and cost of annotation by leveraging routinely generated reports for radiographs, which exist in large quantities as well as in paired form (image-text pairs). Additionally, extensions to localization-aware VLPs are being proposed to address the needs for accurate localization of abnormalities for computer-aided diagnosis (CAD) in CXR. However, we find that the formulation proposed by locality-aware VLP literature actually leads to a loss in spatial relationships required for downstream localization tasks. Therefore, we propose Empowering Locality of VLP with Intra-modal Similarity, ELVIS, a VLP aware of intra-modal locality, to better preserve the locality within radiographs or reports, which enhances the ability to comprehend location references in text reports. Our locality-aware VLP method significantly outperforms state-of-the art baselines in multiple segmentation tasks and the MS-CXR phrase grounding task. Qualitatively, we show that ELVIS focuses well on regions of interest described in the report text compared to prior approaches, allowing for enhanced interpretability.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2799918090</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2799918090</sourcerecordid><originalsourceid>FETCH-proquest_journals_27999180903</originalsourceid><addsrcrecordid>eNqNi9EKgjAYRkcQJOU7_ND1YM5M120YCV4EhbcyatpkbrZNpLfPoAfo6uNwzrdAAY3jCGc7SlcodK4jhNB9SpMkDlCVl1VxPUDeD2YSVuoWSnPnSvo3mAYq6aTRUHLdjrwVcLECe8ul_oaT9E8o9My4Nw-u4Cp7qbidvxu0bLhyIvztGm1P-e14xoM1r1E4X3dmtHpWNU0ZY1FGGIn_qz6kJEC-</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2799918090</pqid></control><display><type>article</type><title>ELVIS: Empowering Locality of Vision Language Pre-training with Intra-modal Similarity</title><source>Free E- Journals</source><creator>Seo, Sumin ; Shin, JaeWoong ; Kang, Jaewoo ; Kim, Tae Soo ; Kooi, Thijs</creator><creatorcontrib>Seo, Sumin ; Shin, JaeWoong ; Kang, Jaewoo ; Kim, Tae Soo ; Kooi, Thijs</creatorcontrib><description>Deep learning has shown great potential in assisting radiologists in reading chest X-ray (CXR) images, but its need for expensive annotations for improving performance prevents widespread clinical application. Visual language pre-training (VLP) can alleviate the burden and cost of annotation by leveraging routinely generated reports for radiographs, which exist in large quantities as well as in paired form (image-text pairs). Additionally, extensions to localization-aware VLPs are being proposed to address the needs for accurate localization of abnormalities for computer-aided diagnosis (CAD) in CXR. However, we find that the formulation proposed by locality-aware VLP literature actually leads to a loss in spatial relationships required for downstream localization tasks. Therefore, we propose Empowering Locality of VLP with Intra-modal Similarity, ELVIS, a VLP aware of intra-modal locality, to better preserve the locality within radiographs or reports, which enhances the ability to comprehend location references in text reports. Our locality-aware VLP method significantly outperforms state-of-the art baselines in multiple segmentation tasks and the MS-CXR phrase grounding task. Qualitatively, we show that ELVIS focuses well on regions of interest described in the report text compared to prior approaches, allowing for enhanced interpretability.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Abnormalities ; Annotations ; Empowerment ; Image segmentation ; Localization ; Radiographs ; Similarity ; Training</subject><ispartof>arXiv.org, 2023-07</ispartof><rights>2023. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Seo, Sumin</creatorcontrib><creatorcontrib>Shin, JaeWoong</creatorcontrib><creatorcontrib>Kang, Jaewoo</creatorcontrib><creatorcontrib>Kim, Tae Soo</creatorcontrib><creatorcontrib>Kooi, Thijs</creatorcontrib><title>ELVIS: Empowering Locality of Vision Language Pre-training with Intra-modal Similarity</title><title>arXiv.org</title><description>Deep learning has shown great potential in assisting radiologists in reading chest X-ray (CXR) images, but its need for expensive annotations for improving performance prevents widespread clinical application. Visual language pre-training (VLP) can alleviate the burden and cost of annotation by leveraging routinely generated reports for radiographs, which exist in large quantities as well as in paired form (image-text pairs). Additionally, extensions to localization-aware VLPs are being proposed to address the needs for accurate localization of abnormalities for computer-aided diagnosis (CAD) in CXR. However, we find that the formulation proposed by locality-aware VLP literature actually leads to a loss in spatial relationships required for downstream localization tasks. Therefore, we propose Empowering Locality of VLP with Intra-modal Similarity, ELVIS, a VLP aware of intra-modal locality, to better preserve the locality within radiographs or reports, which enhances the ability to comprehend location references in text reports. Our locality-aware VLP method significantly outperforms state-of-the art baselines in multiple segmentation tasks and the MS-CXR phrase grounding task. Qualitatively, we show that ELVIS focuses well on regions of interest described in the report text compared to prior approaches, allowing for enhanced interpretability.</description><subject>Abnormalities</subject><subject>Annotations</subject><subject>Empowerment</subject><subject>Image segmentation</subject><subject>Localization</subject><subject>Radiographs</subject><subject>Similarity</subject><subject>Training</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNi9EKgjAYRkcQJOU7_ND1YM5M120YCV4EhbcyatpkbrZNpLfPoAfo6uNwzrdAAY3jCGc7SlcodK4jhNB9SpMkDlCVl1VxPUDeD2YSVuoWSnPnSvo3mAYq6aTRUHLdjrwVcLECe8ul_oaT9E8o9My4Nw-u4Cp7qbidvxu0bLhyIvztGm1P-e14xoM1r1E4X3dmtHpWNU0ZY1FGGIn_qz6kJEC-</recordid><startdate>20230723</startdate><enddate>20230723</enddate><creator>Seo, Sumin</creator><creator>Shin, JaeWoong</creator><creator>Kang, Jaewoo</creator><creator>Kim, Tae Soo</creator><creator>Kooi, Thijs</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20230723</creationdate><title>ELVIS: Empowering Locality of Vision Language Pre-training with Intra-modal Similarity</title><author>Seo, Sumin ; Shin, JaeWoong ; Kang, Jaewoo ; Kim, Tae Soo ; Kooi, Thijs</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_27999180903</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Abnormalities</topic><topic>Annotations</topic><topic>Empowerment</topic><topic>Image segmentation</topic><topic>Localization</topic><topic>Radiographs</topic><topic>Similarity</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Seo, Sumin</creatorcontrib><creatorcontrib>Shin, JaeWoong</creatorcontrib><creatorcontrib>Kang, Jaewoo</creatorcontrib><creatorcontrib>Kim, Tae Soo</creatorcontrib><creatorcontrib>Kooi, Thijs</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Seo, Sumin</au><au>Shin, JaeWoong</au><au>Kang, Jaewoo</au><au>Kim, Tae Soo</au><au>Kooi, Thijs</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>ELVIS: Empowering Locality of Vision Language Pre-training with Intra-modal Similarity</atitle><jtitle>arXiv.org</jtitle><date>2023-07-23</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>Deep learning has shown great potential in assisting radiologists in reading chest X-ray (CXR) images, but its need for expensive annotations for improving performance prevents widespread clinical application. Visual language pre-training (VLP) can alleviate the burden and cost of annotation by leveraging routinely generated reports for radiographs, which exist in large quantities as well as in paired form (image-text pairs). Additionally, extensions to localization-aware VLPs are being proposed to address the needs for accurate localization of abnormalities for computer-aided diagnosis (CAD) in CXR. However, we find that the formulation proposed by locality-aware VLP literature actually leads to a loss in spatial relationships required for downstream localization tasks. Therefore, we propose Empowering Locality of VLP with Intra-modal Similarity, ELVIS, a VLP aware of intra-modal locality, to better preserve the locality within radiographs or reports, which enhances the ability to comprehend location references in text reports. Our locality-aware VLP method significantly outperforms state-of-the art baselines in multiple segmentation tasks and the MS-CXR phrase grounding task. Qualitatively, we show that ELVIS focuses well on regions of interest described in the report text compared to prior approaches, allowing for enhanced interpretability.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2023-07
issn 2331-8422
language eng
recordid cdi_proquest_journals_2799918090
source Free E- Journals
subjects Abnormalities
Annotations
Empowerment
Image segmentation
Localization
Radiographs
Similarity
Training
title ELVIS: Empowering Locality of Vision Language Pre-training with Intra-modal Similarity
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-13T00%3A54%3A38IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=ELVIS:%20Empowering%20Locality%20of%20Vision%20Language%20Pre-training%20with%20Intra-modal%20Similarity&rft.jtitle=arXiv.org&rft.au=Seo,%20Sumin&rft.date=2023-07-23&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2799918090%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2799918090&rft_id=info:pmid/&rfr_iscdi=true