Heterogeneous Feature Alignment and Fusion in Cross-Modal Augmented Space for Composed Image Retrieval

Composed image retrieval (CIR) aims at fusing a reference image and text feedback to search for the desired images. Compared to general image retrieval, it can model the users' search intent more comprehensively and search the target images more accurately, which has significant impacts in vari...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on multimedia 2023-01, Vol.25, p.1-12
Hauptverfasser: Pang, Huaxin, Wei, Shikui, Zhang, Gangjian, Zhang, Shiyin, Qiu, Shuang, Zhao, Yao
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 12
container_issue
container_start_page 1
container_title IEEE transactions on multimedia
container_volume 25
creator Pang, Huaxin
Wei, Shikui
Zhang, Gangjian
Zhang, Shiyin
Qiu, Shuang
Zhao, Yao
description Composed image retrieval (CIR) aims at fusing a reference image and text feedback to search for the desired images. Compared to general image retrieval, it can model the users' search intent more comprehensively and search the target images more accurately, which has significant impacts in various real-world applications, such as E-commerce and Internet search. However, because of the existing heterogeneous semantic gap, the synthetic understanding and fusion of both image and text are difficult to implement. In this work, to tackle this difficult problem, we propose an end-to-end framework MCR, which uses text and images as retrieval queries. The framework mainly includes four pivotal modules. Specifically, we introduce the Relative Caption-aware Consistency (RCC) constraint to align text pieces and images in the database, which can effectually bridge the heterogeneous gap. The Multi-modal Complementary Fusion (MCF) and Cross-modal Guided Pooling (CGP) are constructed to mine multiple interactions between image local features and text word features and learn the complementary representation of the composed query. Furthermore, we develop a plug-and-play Weak-text Semantic Augment (WSA) module for datasets with short or incomplete query texts, which can supplement the weak-text features and is conducive to modeling an augmented semantic space. Extensive experiments demonstrate the practical superior performance over the existing state-of-the-art empirical algorithms on several benchmarks.
doi_str_mv 10.1109/TMM.2022.3208742
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_9899752</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9899752</ieee_id><sourcerecordid>2885649788</sourcerecordid><originalsourceid>FETCH-LOGICAL-c244t-6d47f24b8b46d77fb3bf014f919d1e9ff8da7a78c48f836729363ced22d9b6593</originalsourceid><addsrcrecordid>eNo9kM9LwzAUgIMoOKd3wUvAc2eSpk1yHMW5wYag81zS5qV0tE1NWsH_3pYNT-_x-N6vD6FHSlaUEvVyPBxWjDC2ihmRgrMrtKCK04gQIa6nPGEkUoySW3QXwokQyhMiFshuYQDvKujAjQFvQA-jB7xu6qproRuw7gzejKF2Ha47nHkXQnRwRjd4PVYzAQZ_9roEbJ3HmWt7F6bSrtUV4A8YfA0_urlHN1Y3AR4ucYm-Nq_HbBvt39922XoflYzzIUoNF5bxQhY8NULYIi7sdKlVVBkKylpptNBCllxaGaeCqTiNSzCMGVWkiYqX6Pk8t_fue4Qw5Cc3-m5amTMpk5QrIeVEkTNVzu94sHnv61b735ySfLaZTzbz2WZ-sTm1PJ1bagD4x5VUSiQs_gMOnnDX</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2885649788</pqid></control><display><type>article</type><title>Heterogeneous Feature Alignment and Fusion in Cross-Modal Augmented Space for Composed Image Retrieval</title><source>IEEE Electronic Library (IEL)</source><creator>Pang, Huaxin ; Wei, Shikui ; Zhang, Gangjian ; Zhang, Shiyin ; Qiu, Shuang ; Zhao, Yao</creator><creatorcontrib>Pang, Huaxin ; Wei, Shikui ; Zhang, Gangjian ; Zhang, Shiyin ; Qiu, Shuang ; Zhao, Yao</creatorcontrib><description>Composed image retrieval (CIR) aims at fusing a reference image and text feedback to search for the desired images. Compared to general image retrieval, it can model the users' search intent more comprehensively and search the target images more accurately, which has significant impacts in various real-world applications, such as E-commerce and Internet search. However, because of the existing heterogeneous semantic gap, the synthetic understanding and fusion of both image and text are difficult to implement. In this work, to tackle this difficult problem, we propose an end-to-end framework MCR, which uses text and images as retrieval queries. The framework mainly includes four pivotal modules. Specifically, we introduce the Relative Caption-aware Consistency (RCC) constraint to align text pieces and images in the database, which can effectually bridge the heterogeneous gap. The Multi-modal Complementary Fusion (MCF) and Cross-modal Guided Pooling (CGP) are constructed to mine multiple interactions between image local features and text word features and learn the complementary representation of the composed query. Furthermore, we develop a plug-and-play Weak-text Semantic Augment (WSA) module for datasets with short or incomplete query texts, which can supplement the weak-text features and is conducive to modeling an augmented semantic space. Extensive experiments demonstrate the practical superior performance over the existing state-of-the-art empirical algorithms on several benchmarks.</description><identifier>ISSN: 1520-9210</identifier><identifier>EISSN: 1941-0077</identifier><identifier>DOI: 10.1109/TMM.2022.3208742</identifier><identifier>CODEN: ITMUF8</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Algorithms ; Composed Image Retrieval ; Embedding Fusion ; Feature extraction ; Fuses ; Image retrieval ; Modules ; Multi-modal learning ; Searching ; Semantics ; Task analysis ; Transformers ; Visualization</subject><ispartof>IEEE transactions on multimedia, 2023-01, Vol.25, p.1-12</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c244t-6d47f24b8b46d77fb3bf014f919d1e9ff8da7a78c48f836729363ced22d9b6593</cites><orcidid>0000-0003-3803-9763 ; 0000-0003-1503-4513 ; 0000-0002-2128-0190 ; 0000-0002-9769-5042 ; 0000-0002-8581-9554 ; 0000-0003-4998-4576</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9899752$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9899752$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Pang, Huaxin</creatorcontrib><creatorcontrib>Wei, Shikui</creatorcontrib><creatorcontrib>Zhang, Gangjian</creatorcontrib><creatorcontrib>Zhang, Shiyin</creatorcontrib><creatorcontrib>Qiu, Shuang</creatorcontrib><creatorcontrib>Zhao, Yao</creatorcontrib><title>Heterogeneous Feature Alignment and Fusion in Cross-Modal Augmented Space for Composed Image Retrieval</title><title>IEEE transactions on multimedia</title><addtitle>TMM</addtitle><description>Composed image retrieval (CIR) aims at fusing a reference image and text feedback to search for the desired images. Compared to general image retrieval, it can model the users' search intent more comprehensively and search the target images more accurately, which has significant impacts in various real-world applications, such as E-commerce and Internet search. However, because of the existing heterogeneous semantic gap, the synthetic understanding and fusion of both image and text are difficult to implement. In this work, to tackle this difficult problem, we propose an end-to-end framework MCR, which uses text and images as retrieval queries. The framework mainly includes four pivotal modules. Specifically, we introduce the Relative Caption-aware Consistency (RCC) constraint to align text pieces and images in the database, which can effectually bridge the heterogeneous gap. The Multi-modal Complementary Fusion (MCF) and Cross-modal Guided Pooling (CGP) are constructed to mine multiple interactions between image local features and text word features and learn the complementary representation of the composed query. Furthermore, we develop a plug-and-play Weak-text Semantic Augment (WSA) module for datasets with short or incomplete query texts, which can supplement the weak-text features and is conducive to modeling an augmented semantic space. Extensive experiments demonstrate the practical superior performance over the existing state-of-the-art empirical algorithms on several benchmarks.</description><subject>Algorithms</subject><subject>Composed Image Retrieval</subject><subject>Embedding Fusion</subject><subject>Feature extraction</subject><subject>Fuses</subject><subject>Image retrieval</subject><subject>Modules</subject><subject>Multi-modal learning</subject><subject>Searching</subject><subject>Semantics</subject><subject>Task analysis</subject><subject>Transformers</subject><subject>Visualization</subject><issn>1520-9210</issn><issn>1941-0077</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kM9LwzAUgIMoOKd3wUvAc2eSpk1yHMW5wYag81zS5qV0tE1NWsH_3pYNT-_x-N6vD6FHSlaUEvVyPBxWjDC2ihmRgrMrtKCK04gQIa6nPGEkUoySW3QXwokQyhMiFshuYQDvKujAjQFvQA-jB7xu6qproRuw7gzejKF2Ha47nHkXQnRwRjd4PVYzAQZ_9roEbJ3HmWt7F6bSrtUV4A8YfA0_urlHN1Y3AR4ucYm-Nq_HbBvt39922XoflYzzIUoNF5bxQhY8NULYIi7sdKlVVBkKylpptNBCllxaGaeCqTiNSzCMGVWkiYqX6Pk8t_fue4Qw5Cc3-m5amTMpk5QrIeVEkTNVzu94sHnv61b735ySfLaZTzbz2WZ-sTm1PJ1bagD4x5VUSiQs_gMOnnDX</recordid><startdate>20230101</startdate><enddate>20230101</enddate><creator>Pang, Huaxin</creator><creator>Wei, Shikui</creator><creator>Zhang, Gangjian</creator><creator>Zhang, Shiyin</creator><creator>Qiu, Shuang</creator><creator>Zhao, Yao</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0003-3803-9763</orcidid><orcidid>https://orcid.org/0000-0003-1503-4513</orcidid><orcidid>https://orcid.org/0000-0002-2128-0190</orcidid><orcidid>https://orcid.org/0000-0002-9769-5042</orcidid><orcidid>https://orcid.org/0000-0002-8581-9554</orcidid><orcidid>https://orcid.org/0000-0003-4998-4576</orcidid></search><sort><creationdate>20230101</creationdate><title>Heterogeneous Feature Alignment and Fusion in Cross-Modal Augmented Space for Composed Image Retrieval</title><author>Pang, Huaxin ; Wei, Shikui ; Zhang, Gangjian ; Zhang, Shiyin ; Qiu, Shuang ; Zhao, Yao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c244t-6d47f24b8b46d77fb3bf014f919d1e9ff8da7a78c48f836729363ced22d9b6593</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Algorithms</topic><topic>Composed Image Retrieval</topic><topic>Embedding Fusion</topic><topic>Feature extraction</topic><topic>Fuses</topic><topic>Image retrieval</topic><topic>Modules</topic><topic>Multi-modal learning</topic><topic>Searching</topic><topic>Semantics</topic><topic>Task analysis</topic><topic>Transformers</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Pang, Huaxin</creatorcontrib><creatorcontrib>Wei, Shikui</creatorcontrib><creatorcontrib>Zhang, Gangjian</creatorcontrib><creatorcontrib>Zhang, Shiyin</creatorcontrib><creatorcontrib>Qiu, Shuang</creatorcontrib><creatorcontrib>Zhao, Yao</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on multimedia</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Pang, Huaxin</au><au>Wei, Shikui</au><au>Zhang, Gangjian</au><au>Zhang, Shiyin</au><au>Qiu, Shuang</au><au>Zhao, Yao</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Heterogeneous Feature Alignment and Fusion in Cross-Modal Augmented Space for Composed Image Retrieval</atitle><jtitle>IEEE transactions on multimedia</jtitle><stitle>TMM</stitle><date>2023-01-01</date><risdate>2023</risdate><volume>25</volume><spage>1</spage><epage>12</epage><pages>1-12</pages><issn>1520-9210</issn><eissn>1941-0077</eissn><coden>ITMUF8</coden><abstract>Composed image retrieval (CIR) aims at fusing a reference image and text feedback to search for the desired images. Compared to general image retrieval, it can model the users' search intent more comprehensively and search the target images more accurately, which has significant impacts in various real-world applications, such as E-commerce and Internet search. However, because of the existing heterogeneous semantic gap, the synthetic understanding and fusion of both image and text are difficult to implement. In this work, to tackle this difficult problem, we propose an end-to-end framework MCR, which uses text and images as retrieval queries. The framework mainly includes four pivotal modules. Specifically, we introduce the Relative Caption-aware Consistency (RCC) constraint to align text pieces and images in the database, which can effectually bridge the heterogeneous gap. The Multi-modal Complementary Fusion (MCF) and Cross-modal Guided Pooling (CGP) are constructed to mine multiple interactions between image local features and text word features and learn the complementary representation of the composed query. Furthermore, we develop a plug-and-play Weak-text Semantic Augment (WSA) module for datasets with short or incomplete query texts, which can supplement the weak-text features and is conducive to modeling an augmented semantic space. Extensive experiments demonstrate the practical superior performance over the existing state-of-the-art empirical algorithms on several benchmarks.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TMM.2022.3208742</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0003-3803-9763</orcidid><orcidid>https://orcid.org/0000-0003-1503-4513</orcidid><orcidid>https://orcid.org/0000-0002-2128-0190</orcidid><orcidid>https://orcid.org/0000-0002-9769-5042</orcidid><orcidid>https://orcid.org/0000-0002-8581-9554</orcidid><orcidid>https://orcid.org/0000-0003-4998-4576</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1520-9210
ispartof IEEE transactions on multimedia, 2023-01, Vol.25, p.1-12
issn 1520-9210
1941-0077
language eng
recordid cdi_ieee_primary_9899752
source IEEE Electronic Library (IEL)
subjects Algorithms
Composed Image Retrieval
Embedding Fusion
Feature extraction
Fuses
Image retrieval
Modules
Multi-modal learning
Searching
Semantics
Task analysis
Transformers
Visualization
title Heterogeneous Feature Alignment and Fusion in Cross-Modal Augmented Space for Composed Image Retrieval
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-31T07%3A08%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Heterogeneous%20Feature%20Alignment%20and%20Fusion%20in%20Cross-Modal%20Augmented%20Space%20for%20Composed%20Image%20Retrieval&rft.jtitle=IEEE%20transactions%20on%20multimedia&rft.au=Pang,%20Huaxin&rft.date=2023-01-01&rft.volume=25&rft.spage=1&rft.epage=12&rft.pages=1-12&rft.issn=1520-9210&rft.eissn=1941-0077&rft.coden=ITMUF8&rft_id=info:doi/10.1109/TMM.2022.3208742&rft_dat=%3Cproquest_RIE%3E2885649788%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2885649788&rft_id=info:pmid/&rft_ieee_id=9899752&rfr_iscdi=true