Remote Sensing Image-Text Retrieval With Implicit-Explicit Relation Reasoning

Remote sensing image-text retrieval (RSITR) has become a research hotspot in recent years for its wide application. Existing methods in this context, based either on local or global feature matching, overlook the sensing variation-leaded visual deviation and geographically nearby image-text mismatch...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on geoscience and remote sensing 2024, Vol.62, p.1-11
Hauptverfasser:	Yang, Lingling, Zhou, Tongqing, Ma, Wentao, Du, Mengze, Liu, Lu, Li, Feng, Zhao, Shan, Wang, Yuwei
Format:	Artikel
Sprache:	eng
Schlagworte:	Cognition Electronic mail Feature extraction Implicit relation reasoning masked image modeling (MIM) masked language modeling (MLM) Remote sensing remote sensing image-text retrieval (RSITR) Semantics Sensors Visualization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	11
container_issue
container_start_page	1
container_title	IEEE transactions on geoscience and remote sensing
container_volume	62
creator	Yang, Lingling Zhou, Tongqing Ma, Wentao Du, Mengze Liu, Lu Li, Feng Zhao, Shan Wang, Yuwei
description	Remote sensing image-text retrieval (RSITR) has become a research hotspot in recent years for its wide application. Existing methods in this context, based either on local or global feature matching, overlook the sensing variation-leaded visual deviation and geographically nearby image-text mismatching problems of remote sensing (RS) images. This work notes that this would limit the retrieval accuracy for RSITR. To handle this, we present IERR, an implicit-explicit relation reasoning framework that learns relations between local visual-textual tokens and enhances global image-text matching without requiring additional prior supervision. Specifically, masked image modeling (MIM) and masked language modeling (MLM) are used for symmetric mask reasoning consistency alignment. Meanwhile, masked features (i.e., implicit relation) and unmasked features (i.e., explicit relation) are fed into a multimodal interaction encoder to enhance the representations of the textual-visual features. Extensive experimental results on the RSICD and RSITMD datasets demonstrate the superiority of IERR compared with 17 baselines.
doi_str_mv	10.1109/TGRS.2024.3466909
format	Article
fullrecord	<record><control><sourceid>crossref_RIE</sourceid><recordid>TN_cdi_ieee_primary_10689651</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10689651</ieee_id><sourcerecordid>10_1109_TGRS_2024_3466909</sourcerecordid><originalsourceid>FETCH-LOGICAL-c148t-4026550eb6338fc65ef93ca2309c9a92404dd02d20a7daccf51f31c7ee24a0173</originalsourceid><addsrcrecordid>eNpNkM1Kw0AUhQdRsFYfQHCRF5h67_wls5RSa6EitBWXYZzc1JE0KZlB6tub0i5c3QP3fGfxMXaPMEEE-7iZr9YTAUJNpDLGgr1gI9S64GCUumQjQGu4KKy4ZjcxfgOg0piP2OuKdl2ibE1tDO02W-zclviGDilbUeoD_bgm-wjpa_jsm-BD4rPDKQyFxqXQtUNwsWsH_JZd1a6JdHe-Y_b-PNtMX_jybb6YPi25R1UkrkAYrYE-jZRF7Y2m2krvhATrrbNCgaoqEJUAl1fO-1pjLdHnREI5wFyOGZ52fd_F2FNd7vuwc_1viVAefZRHH-XRR3n2MTAPJyYQ0b--KazRKP8AISBc1w</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Remote Sensing Image-Text Retrieval With Implicit-Explicit Relation Reasoning</title><source>IEEE Electronic Library (IEL)</source><creator>Yang, Lingling ; Zhou, Tongqing ; Ma, Wentao ; Du, Mengze ; Liu, Lu ; Li, Feng ; Zhao, Shan ; Wang, Yuwei</creator><creatorcontrib>Yang, Lingling ; Zhou, Tongqing ; Ma, Wentao ; Du, Mengze ; Liu, Lu ; Li, Feng ; Zhao, Shan ; Wang, Yuwei</creatorcontrib><description>Remote sensing image-text retrieval (RSITR) has become a research hotspot in recent years for its wide application. Existing methods in this context, based either on local or global feature matching, overlook the sensing variation-leaded visual deviation and geographically nearby image-text mismatching problems of remote sensing (RS) images. This work notes that this would limit the retrieval accuracy for RSITR. To handle this, we present IERR, an implicit-explicit relation reasoning framework that learns relations between local visual-textual tokens and enhances global image-text matching without requiring additional prior supervision. Specifically, masked image modeling (MIM) and masked language modeling (MLM) are used for symmetric mask reasoning consistency alignment. Meanwhile, masked features (i.e., implicit relation) and unmasked features (i.e., explicit relation) are fed into a multimodal interaction encoder to enhance the representations of the textual-visual features. Extensive experimental results on the RSICD and RSITMD datasets demonstrate the superiority of IERR compared with 17 baselines.</description><identifier>ISSN: 0196-2892</identifier><identifier>EISSN: 1558-0644</identifier><identifier>DOI: 10.1109/TGRS.2024.3466909</identifier><identifier>CODEN: IGRSD2</identifier><language>eng</language><publisher>IEEE</publisher><subject>Cognition ; Electronic mail ; Feature extraction ; Implicit relation reasoning ; masked image modeling (MIM) ; masked language modeling (MLM) ; Remote sensing ; remote sensing image-text retrieval (RSITR) ; Semantics ; Sensors ; Visualization</subject><ispartof>IEEE transactions on geoscience and remote sensing, 2024, Vol.62, p.1-11</ispartof><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c148t-4026550eb6338fc65ef93ca2309c9a92404dd02d20a7daccf51f31c7ee24a0173</cites><orcidid>0000-0001-8137-671X ; 0000-0003-4282-9821 ; 0009-0005-2362-0282 ; 0000-0001-9862-0432 ; 0000-0003-3059-6629 ; 0000-0002-6620-1898</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10689651$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,4009,27902,27903,27904,54736</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10689651$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Yang, Lingling</creatorcontrib><creatorcontrib>Zhou, Tongqing</creatorcontrib><creatorcontrib>Ma, Wentao</creatorcontrib><creatorcontrib>Du, Mengze</creatorcontrib><creatorcontrib>Liu, Lu</creatorcontrib><creatorcontrib>Li, Feng</creatorcontrib><creatorcontrib>Zhao, Shan</creatorcontrib><creatorcontrib>Wang, Yuwei</creatorcontrib><title>Remote Sensing Image-Text Retrieval With Implicit-Explicit Relation Reasoning</title><title>IEEE transactions on geoscience and remote sensing</title><addtitle>TGRS</addtitle><description>Remote sensing image-text retrieval (RSITR) has become a research hotspot in recent years for its wide application. Existing methods in this context, based either on local or global feature matching, overlook the sensing variation-leaded visual deviation and geographically nearby image-text mismatching problems of remote sensing (RS) images. This work notes that this would limit the retrieval accuracy for RSITR. To handle this, we present IERR, an implicit-explicit relation reasoning framework that learns relations between local visual-textual tokens and enhances global image-text matching without requiring additional prior supervision. Specifically, masked image modeling (MIM) and masked language modeling (MLM) are used for symmetric mask reasoning consistency alignment. Meanwhile, masked features (i.e., implicit relation) and unmasked features (i.e., explicit relation) are fed into a multimodal interaction encoder to enhance the representations of the textual-visual features. Extensive experimental results on the RSICD and RSITMD datasets demonstrate the superiority of IERR compared with 17 baselines.</description><subject>Cognition</subject><subject>Electronic mail</subject><subject>Feature extraction</subject><subject>Implicit relation reasoning</subject><subject>masked image modeling (MIM)</subject><subject>masked language modeling (MLM)</subject><subject>Remote sensing</subject><subject>remote sensing image-text retrieval (RSITR)</subject><subject>Semantics</subject><subject>Sensors</subject><subject>Visualization</subject><issn>0196-2892</issn><issn>1558-0644</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkM1Kw0AUhQdRsFYfQHCRF5h67_wls5RSa6EitBWXYZzc1JE0KZlB6tub0i5c3QP3fGfxMXaPMEEE-7iZr9YTAUJNpDLGgr1gI9S64GCUumQjQGu4KKy4ZjcxfgOg0piP2OuKdl2ibE1tDO02W-zclviGDilbUeoD_bgm-wjpa_jsm-BD4rPDKQyFxqXQtUNwsWsH_JZd1a6JdHe-Y_b-PNtMX_jybb6YPi25R1UkrkAYrYE-jZRF7Y2m2krvhATrrbNCgaoqEJUAl1fO-1pjLdHnREI5wFyOGZ52fd_F2FNd7vuwc_1viVAefZRHH-XRR3n2MTAPJyYQ0b--KazRKP8AISBc1w</recordid><startdate>2024</startdate><enddate>2024</enddate><creator>Yang, Lingling</creator><creator>Zhou, Tongqing</creator><creator>Ma, Wentao</creator><creator>Du, Mengze</creator><creator>Liu, Lu</creator><creator>Li, Feng</creator><creator>Zhao, Shan</creator><creator>Wang, Yuwei</creator><general>IEEE</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0001-8137-671X</orcidid><orcidid>https://orcid.org/0000-0003-4282-9821</orcidid><orcidid>https://orcid.org/0009-0005-2362-0282</orcidid><orcidid>https://orcid.org/0000-0001-9862-0432</orcidid><orcidid>https://orcid.org/0000-0003-3059-6629</orcidid><orcidid>https://orcid.org/0000-0002-6620-1898</orcidid></search><sort><creationdate>2024</creationdate><title>Remote Sensing Image-Text Retrieval With Implicit-Explicit Relation Reasoning</title><author>Yang, Lingling ; Zhou, Tongqing ; Ma, Wentao ; Du, Mengze ; Liu, Lu ; Li, Feng ; Zhao, Shan ; Wang, Yuwei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c148t-4026550eb6338fc65ef93ca2309c9a92404dd02d20a7daccf51f31c7ee24a0173</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Cognition</topic><topic>Electronic mail</topic><topic>Feature extraction</topic><topic>Implicit relation reasoning</topic><topic>masked image modeling (MIM)</topic><topic>masked language modeling (MLM)</topic><topic>Remote sensing</topic><topic>remote sensing image-text retrieval (RSITR)</topic><topic>Semantics</topic><topic>Sensors</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Yang, Lingling</creatorcontrib><creatorcontrib>Zhou, Tongqing</creatorcontrib><creatorcontrib>Ma, Wentao</creatorcontrib><creatorcontrib>Du, Mengze</creatorcontrib><creatorcontrib>Liu, Lu</creatorcontrib><creatorcontrib>Li, Feng</creatorcontrib><creatorcontrib>Zhao, Shan</creatorcontrib><creatorcontrib>Wang, Yuwei</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><jtitle>IEEE transactions on geoscience and remote sensing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Yang, Lingling</au><au>Zhou, Tongqing</au><au>Ma, Wentao</au><au>Du, Mengze</au><au>Liu, Lu</au><au>Li, Feng</au><au>Zhao, Shan</au><au>Wang, Yuwei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Remote Sensing Image-Text Retrieval With Implicit-Explicit Relation Reasoning</atitle><jtitle>IEEE transactions on geoscience and remote sensing</jtitle><stitle>TGRS</stitle><date>2024</date><risdate>2024</risdate><volume>62</volume><spage>1</spage><epage>11</epage><pages>1-11</pages><issn>0196-2892</issn><eissn>1558-0644</eissn><coden>IGRSD2</coden><abstract>Remote sensing image-text retrieval (RSITR) has become a research hotspot in recent years for its wide application. Existing methods in this context, based either on local or global feature matching, overlook the sensing variation-leaded visual deviation and geographically nearby image-text mismatching problems of remote sensing (RS) images. This work notes that this would limit the retrieval accuracy for RSITR. To handle this, we present IERR, an implicit-explicit relation reasoning framework that learns relations between local visual-textual tokens and enhances global image-text matching without requiring additional prior supervision. Specifically, masked image modeling (MIM) and masked language modeling (MLM) are used for symmetric mask reasoning consistency alignment. Meanwhile, masked features (i.e., implicit relation) and unmasked features (i.e., explicit relation) are fed into a multimodal interaction encoder to enhance the representations of the textual-visual features. Extensive experimental results on the RSICD and RSITMD datasets demonstrate the superiority of IERR compared with 17 baselines.</abstract><pub>IEEE</pub><doi>10.1109/TGRS.2024.3466909</doi><tpages>11</tpages><orcidid>https://orcid.org/0000-0001-8137-671X</orcidid><orcidid>https://orcid.org/0000-0003-4282-9821</orcidid><orcidid>https://orcid.org/0009-0005-2362-0282</orcidid><orcidid>https://orcid.org/0000-0001-9862-0432</orcidid><orcidid>https://orcid.org/0000-0003-3059-6629</orcidid><orcidid>https://orcid.org/0000-0002-6620-1898</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 0196-2892
ispartof	IEEE transactions on geoscience and remote sensing, 2024, Vol.62, p.1-11
issn	0196-2892 1558-0644
language	eng
recordid	cdi_ieee_primary_10689651
source	IEEE Electronic Library (IEL)
subjects	Cognition Electronic mail Feature extraction Implicit relation reasoning masked image modeling (MIM) masked language modeling (MLM) Remote sensing remote sensing image-text retrieval (RSITR) Semantics Sensors Visualization
title	Remote Sensing Image-Text Retrieval With Implicit-Explicit Relation Reasoning
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-23T16%3A57%3A58IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-crossref_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Remote%20Sensing%20Image-Text%20Retrieval%20With%20Implicit-Explicit%20Relation%20Reasoning&rft.jtitle=IEEE%20transactions%20on%20geoscience%20and%20remote%20sensing&rft.au=Yang,%20Lingling&rft.date=2024&rft.volume=62&rft.spage=1&rft.epage=11&rft.pages=1-11&rft.issn=0196-2892&rft.eissn=1558-0644&rft.coden=IGRSD2&rft_id=info:doi/10.1109/TGRS.2024.3466909&rft_dat=%3Ccrossref_RIE%3E10_1109_TGRS_2024_3466909%3C/crossref_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rft_ieee_id=10689651&rfr_iscdi=true