Heterogeneous Feature Alignment and Fusion in Cross-Modal Augmented Space for Composed Image Retrieval

Composed image retrieval (CIR) aims at fusing a reference image and text feedback to search for the desired images. Compared to general image retrieval, it can model the users' search intent more comprehensively and search the target images more accurately, which has significant impacts in vari...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on multimedia 2023-01, Vol.25, p.1-12
Hauptverfasser:	Pang, Huaxin, Wei, Shikui, Zhang, Gangjian, Zhang, Shiyin, Qiu, Shuang, Zhao, Yao
Format:	Artikel
Sprache:	eng
Schlagworte:	Algorithms Composed Image Retrieval Embedding Fusion Feature extraction Fuses Image retrieval Modules Multi-modal learning Searching Semantics Task analysis Transformers Visualization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	12
container_issue
container_start_page	1
container_title	IEEE transactions on multimedia
container_volume	25
creator	Pang, Huaxin Wei, Shikui Zhang, Gangjian Zhang, Shiyin Qiu, Shuang Zhao, Yao
description	Composed image retrieval (CIR) aims at fusing a reference image and text feedback to search for the desired images. Compared to general image retrieval, it can model the users' search intent more comprehensively and search the target images more accurately, which has significant impacts in various real-world applications, such as E-commerce and Internet search. However, because of the existing heterogeneous semantic gap, the synthetic understanding and fusion of both image and text are difficult to implement. In this work, to tackle this difficult problem, we propose an end-to-end framework MCR, which uses text and images as retrieval queries. The framework mainly includes four pivotal modules. Specifically, we introduce the Relative Caption-aware Consistency (RCC) constraint to align text pieces and images in the database, which can effectually bridge the heterogeneous gap. The Multi-modal Complementary Fusion (MCF) and Cross-modal Guided Pooling (CGP) are constructed to mine multiple interactions between image local features and text word features and learn the complementary representation of the composed query. Furthermore, we develop a plug-and-play Weak-text Semantic Augment (WSA) module for datasets with short or incomplete query texts, which can supplement the weak-text features and is conducive to modeling an augmented semantic space. Extensive experiments demonstrate the practical superior performance over the existing state-of-the-art empirical algorithms on several benchmarks.
doi_str_mv	10.1109/TMM.2022.3208742
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_9899752</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9899752</ieee_id><sourcerecordid>2885649788</sourcerecordid><originalsourceid>FETCH-LOGICAL-c244t-6d47f24b8b46d77fb3bf014f919d1e9ff8da7a78c48f836729363ced22d9b6593</originalsourceid><addsrcrecordid>eNo9kM9LwzAUgIMoOKd3wUvAc2eSpk1yHMW5wYag81zS5qV0tE1NWsH_3pYNT-_x-N6vD6FHSlaUEvVyPBxWjDC2ihmRgrMrtKCK04gQIa6nPGEkUoySW3QXwokQyhMiFshuYQDvKujAjQFvQA-jB7xu6qproRuw7gzejKF2Ha47nHkXQnRwRjd4PVYzAQZ_9roEbJ3HmWt7F6bSrtUV4A8YfA0_urlHN1Y3AR4ucYm-Nq_HbBvt39922XoflYzzIUoNF5bxQhY8NULYIi7sdKlVVBkKylpptNBCllxaGaeCqTiNSzCMGVWkiYqX6Pk8t_fue4Qw5Cc3-m5amTMpk5QrIeVEkTNVzu94sHnv61b735ySfLaZTzbz2WZ-sTm1PJ1bagD4x5VUSiQs_gMOnnDX</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2885649788</pqid></control><display><type>article</type><title>Heterogeneous Feature Alignment and Fusion in Cross-Modal Augmented Space for Composed Image Retrieval</title><source>IEEE Electronic Library (IEL)</source><creator>Pang, Huaxin ; Wei, Shikui ; Zhang, Gangjian ; Zhang, Shiyin ; Qiu, Shuang ; Zhao, Yao</creator><creatorcontrib>Pang, Huaxin ; Wei, Shikui ; Zhang, Gangjian ; Zhang, Shiyin ; Qiu, Shuang ; Zhao, Yao</creatorcontrib><description>Composed image retrieval (CIR) aims at fusing a reference image and text feedback to search for the desired images. Compared to general image retrieval, it can model the users' search intent more comprehensively and search the target images more accurately, which has significant impacts in various real-world applications, such as E-commerce and Internet search. However, because of the existing heterogeneous semantic gap, the synthetic understanding and fusion of both image and text are difficult to implement. In this work, to tackle this difficult problem, we propose an end-to-end framework MCR, which uses text and images as retrieval queries. The framework mainly includes four pivotal modules. Specifically, we introduce the Relative Caption-aware Consistency (RCC) constraint to align text pieces and images in the database, which can effectually bridge the heterogeneous gap. The Multi-modal Complementary Fusion (MCF) and Cross-modal Guided Pooling (CGP) are constructed to mine multiple interactions between image local features and text word features and learn the complementary representation of the composed query. Furthermore, we develop a plug-and-play Weak-text Semantic Augment (WSA) module for datasets with short or incomplete query texts, which can supplement the weak-text features and is conducive to modeling an augmented semantic space. Extensive experiments demonstrate the practical superior performance over the existing state-of-the-art empirical algorithms on several benchmarks.</description><identifier>ISSN: 1520-9210</identifier><identifier>EISSN: 1941-0077</identifier><identifier>DOI: 10.1109/TMM.2022.3208742</identifier><identifier>CODEN: ITMUF8</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Algorithms ; Composed Image Retrieval ; Embedding Fusion ; Feature extraction ; Fuses ; Image retrieval ; Modules ; Multi-modal learning ; Searching ; Semantics ; Task analysis ; Transformers ; Visualization</subject><ispartof>IEEE transactions on multimedia, 2023-01, Vol.25, p.1-12</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c244t-6d47f24b8b46d77fb3bf014f919d1e9ff8da7a78c48f836729363ced22d9b6593</cites><orcidid>0000-0003-3803-9763 ; 0000-0003-1503-4513 ; 0000-0002-2128-0190 ; 0000-0002-9769-5042 ; 0000-0002-8581-9554 ; 0000-0003-4998-4576</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9899752$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9899752$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Pang, Huaxin</creatorcontrib><creatorcontrib>Wei, Shikui</creatorcontrib><creatorcontrib>Zhang, Gangjian</creatorcontrib><creatorcontrib>Zhang, Shiyin</creatorcontrib><creatorcontrib>Qiu, Shuang</creatorcontrib><creatorcontrib>Zhao, Yao</creatorcontrib><title>Heterogeneous Feature Alignment and Fusion in Cross-Modal Augmented Space for Composed Image Retrieval</title><title>IEEE transactions on multimedia</title><addtitle>TMM</addtitle><description>Composed image retrieval (CIR) aims at fusing a reference image and text feedback to search for the desired images. Compared to general image retrieval, it can model the users' search intent more comprehensively and search the target images more accurately, which has significant impacts in various real-world applications, such as E-commerce and Internet search. However, because of the existing heterogeneous semantic gap, the synthetic understanding and fusion of both image and text are difficult to implement. In this work, to tackle this difficult problem, we propose an end-to-end framework MCR, which uses text and images as retrieval queries. The framework mainly includes four pivotal modules. Specifically, we introduce the Relative Caption-aware Consistency (RCC) constraint to align text pieces and images in the database, which can effectually bridge the heterogeneous gap. The Multi-modal Complementary Fusion (MCF) and Cross-modal Guided Pooling (CGP) are constructed to mine multiple interactions between image local features and text word features and learn the complementary representation of the composed query. Furthermore, we develop a plug-and-play Weak-text Semantic Augment (WSA) module for datasets with short or incomplete query texts, which can supplement the weak-text features and is conducive to modeling an augmented semantic space. Extensive experiments demonstrate the practical superior performance over the existing state-of-the-art empirical algorithms on several benchmarks.</description><subject>Algorithms</subject><subject>Composed Image Retrieval</subject><subject>Embedding Fusion</subject><subject>Feature extraction</subject><subject>Fuses</subject><subject>Image retrieval</subject><subject>Modules</subject><subject>Multi-modal learning</subject><subject>Searching</subject><subject>Semantics</subject><subject>Task analysis</subject><subject>Transformers</subject><subject>Visualization</subject><issn>1520-9210</issn><issn>1941-0077</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kM9LwzAUgIMoOKd3wUvAc2eSpk1yHMW5wYag81zS5qV0tE1NWsH_3pYNT-_x-N6vD6FHSlaUEvVyPBxWjDC2ihmRgrMrtKCK04gQIa6nPGEkUoySW3QXwokQyhMiFshuYQDvKujAjQFvQA-jB7xu6qproRuw7gzejKF2Ha47nHkXQnRwRjd4PVYzAQZ_9roEbJ3HmWt7F6bSrtUV4A8YfA0_urlHN1Y3AR4ucYm-Nq_HbBvt39922XoflYzzIUoNF5bxQhY8NULYIi7sdKlVVBkKylpptNBCllxaGaeCqTiNSzCMGVWkiYqX6Pk8t_fue4Qw5Cc3-m5amTMpk5QrIeVEkTNVzu94sHnv61b735ySfLaZTzbz2WZ-sTm1PJ1bagD4x5VUSiQs_gMOnnDX</recordid><startdate>20230101</startdate><enddate>20230101</enddate><creator>Pang, Huaxin</creator><creator>Wei, Shikui</creator><creator>Zhang, Gangjian</creator><creator>Zhang, Shiyin</creator><creator>Qiu, Shuang</creator><creator>Zhao, Yao</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0003-3803-9763</orcidid><orcidid>https://orcid.org/0000-0003-1503-4513</orcidid><orcidid>https://orcid.org/0000-0002-2128-0190</orcidid><orcidid>https://orcid.org/0000-0002-9769-5042</orcidid><orcidid>https://orcid.org/0000-0002-8581-9554</orcidid><orcidid>https://orcid.org/0000-0003-4998-4576</orcidid></search><sort><creationdate>20230101</creationdate><title>Heterogeneous Feature Alignment and Fusion in Cross-Modal Augmented Space for Composed Image Retrieval</title><author>Pang, Huaxin ; Wei, Shikui ; Zhang, Gangjian ; Zhang, Shiyin ; Qiu, Shuang ; Zhao, Yao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c244t-6d47f24b8b46d77fb3bf014f919d1e9ff8da7a78c48f836729363ced22d9b6593</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Algorithms</topic><topic>Composed Image Retrieval</topic><topic>Embedding Fusion</topic><topic>Feature extraction</topic><topic>Fuses</topic><topic>Image retrieval</topic><topic>Modules</topic><topic>Multi-modal learning</topic><topic>Searching</topic><topic>Semantics</topic><topic>Task analysis</topic><topic>Transformers</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Pang, Huaxin</creatorcontrib><creatorcontrib>Wei, Shikui</creatorcontrib><creatorcontrib>Zhang, Gangjian</creatorcontrib><creatorcontrib>Zhang, Shiyin</creatorcontrib><creatorcontrib>Qiu, Shuang</creatorcontrib><creatorcontrib>Zhao, Yao</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on multimedia</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Pang, Huaxin</au><au>Wei, Shikui</au><au>Zhang, Gangjian</au><au>Zhang, Shiyin</au><au>Qiu, Shuang</au><au>Zhao, Yao</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Heterogeneous Feature Alignment and Fusion in Cross-Modal Augmented Space for Composed Image Retrieval</atitle><jtitle>IEEE transactions on multimedia</jtitle><stitle>TMM</stitle><date>2023-01-01</date><risdate>2023</risdate><volume>25</volume><spage>1</spage><epage>12</epage><pages>1-12</pages><issn>1520-9210</issn><eissn>1941-0077</eissn><coden>ITMUF8</coden><abstract>Composed image retrieval (CIR) aims at fusing a reference image and text feedback to search for the desired images. Compared to general image retrieval, it can model the users' search intent more comprehensively and search the target images more accurately, which has significant impacts in various real-world applications, such as E-commerce and Internet search. However, because of the existing heterogeneous semantic gap, the synthetic understanding and fusion of both image and text are difficult to implement. In this work, to tackle this difficult problem, we propose an end-to-end framework MCR, which uses text and images as retrieval queries. The framework mainly includes four pivotal modules. Specifically, we introduce the Relative Caption-aware Consistency (RCC) constraint to align text pieces and images in the database, which can effectually bridge the heterogeneous gap. The Multi-modal Complementary Fusion (MCF) and Cross-modal Guided Pooling (CGP) are constructed to mine multiple interactions between image local features and text word features and learn the complementary representation of the composed query. Furthermore, we develop a plug-and-play Weak-text Semantic Augment (WSA) module for datasets with short or incomplete query texts, which can supplement the weak-text features and is conducive to modeling an augmented semantic space. Extensive experiments demonstrate the practical superior performance over the existing state-of-the-art empirical algorithms on several benchmarks.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TMM.2022.3208742</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0003-3803-9763</orcidid><orcidid>https://orcid.org/0000-0003-1503-4513</orcidid><orcidid>https://orcid.org/0000-0002-2128-0190</orcidid><orcidid>https://orcid.org/0000-0002-9769-5042</orcidid><orcidid>https://orcid.org/0000-0002-8581-9554</orcidid><orcidid>https://orcid.org/0000-0003-4998-4576</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1520-9210
ispartof	IEEE transactions on multimedia, 2023-01, Vol.25, p.1-12
issn	1520-9210 1941-0077
language	eng
recordid	cdi_ieee_primary_9899752
source	IEEE Electronic Library (IEL)
subjects	Algorithms Composed Image Retrieval Embedding Fusion Feature extraction Fuses Image retrieval Modules Multi-modal learning Searching Semantics Task analysis Transformers Visualization
title	Heterogeneous Feature Alignment and Fusion in Cross-Modal Augmented Space for Composed Image Retrieval
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-31T07%3A08%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Heterogeneous%20Feature%20Alignment%20and%20Fusion%20in%20Cross-Modal%20Augmented%20Space%20for%20Composed%20Image%20Retrieval&rft.jtitle=IEEE%20transactions%20on%20multimedia&rft.au=Pang,%20Huaxin&rft.date=2023-01-01&rft.volume=25&rft.spage=1&rft.epage=12&rft.pages=1-12&rft.issn=1520-9210&rft.eissn=1941-0077&rft.coden=ITMUF8&rft_id=info:doi/10.1109/TMM.2022.3208742&rft_dat=%3Cproquest_RIE%3E2885649788%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2885649788&rft_id=info:pmid/&rft_ieee_id=9899752&rfr_iscdi=true