Heterogeneous Feature Alignment and Fusion in Cross-Modal Augmented Space for Composed Image Retrieval
Composed image retrieval (CIR) aims at fusing a reference image and text feedback to search for the desired images. Compared to general image retrieval, it can model the users' search intent more comprehensively and search the target images more accurately, which has significant impacts in vari...
Gespeichert in:
Veröffentlicht in: | IEEE transactions on multimedia 2023-01, Vol.25, p.1-12 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 12 |
---|---|
container_issue | |
container_start_page | 1 |
container_title | IEEE transactions on multimedia |
container_volume | 25 |
creator | Pang, Huaxin Wei, Shikui Zhang, Gangjian Zhang, Shiyin Qiu, Shuang Zhao, Yao |
description | Composed image retrieval (CIR) aims at fusing a reference image and text feedback to search for the desired images. Compared to general image retrieval, it can model the users' search intent more comprehensively and search the target images more accurately, which has significant impacts in various real-world applications, such as E-commerce and Internet search. However, because of the existing heterogeneous semantic gap, the synthetic understanding and fusion of both image and text are difficult to implement. In this work, to tackle this difficult problem, we propose an end-to-end framework MCR, which uses text and images as retrieval queries. The framework mainly includes four pivotal modules. Specifically, we introduce the Relative Caption-aware Consistency (RCC) constraint to align text pieces and images in the database, which can effectually bridge the heterogeneous gap. The Multi-modal Complementary Fusion (MCF) and Cross-modal Guided Pooling (CGP) are constructed to mine multiple interactions between image local features and text word features and learn the complementary representation of the composed query. Furthermore, we develop a plug-and-play Weak-text Semantic Augment (WSA) module for datasets with short or incomplete query texts, which can supplement the weak-text features and is conducive to modeling an augmented semantic space. Extensive experiments demonstrate the practical superior performance over the existing state-of-the-art empirical algorithms on several benchmarks. |
doi_str_mv | 10.1109/TMM.2022.3208742 |
format | Article |
fullrecord | <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_9899752</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9899752</ieee_id><sourcerecordid>2885649788</sourcerecordid><originalsourceid>FETCH-LOGICAL-c244t-6d47f24b8b46d77fb3bf014f919d1e9ff8da7a78c48f836729363ced22d9b6593</originalsourceid><addsrcrecordid>eNo9kM9LwzAUgIMoOKd3wUvAc2eSpk1yHMW5wYag81zS5qV0tE1NWsH_3pYNT-_x-N6vD6FHSlaUEvVyPBxWjDC2ihmRgrMrtKCK04gQIa6nPGEkUoySW3QXwokQyhMiFshuYQDvKujAjQFvQA-jB7xu6qproRuw7gzejKF2Ha47nHkXQnRwRjd4PVYzAQZ_9roEbJ3HmWt7F6bSrtUV4A8YfA0_urlHN1Y3AR4ucYm-Nq_HbBvt39922XoflYzzIUoNF5bxQhY8NULYIi7sdKlVVBkKylpptNBCllxaGaeCqTiNSzCMGVWkiYqX6Pk8t_fue4Qw5Cc3-m5amTMpk5QrIeVEkTNVzu94sHnv61b735ySfLaZTzbz2WZ-sTm1PJ1bagD4x5VUSiQs_gMOnnDX</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2885649788</pqid></control><display><type>article</type><title>Heterogeneous Feature Alignment and Fusion in Cross-Modal Augmented Space for Composed Image Retrieval</title><source>IEEE Electronic Library (IEL)</source><creator>Pang, Huaxin ; Wei, Shikui ; Zhang, Gangjian ; Zhang, Shiyin ; Qiu, Shuang ; Zhao, Yao</creator><creatorcontrib>Pang, Huaxin ; Wei, Shikui ; Zhang, Gangjian ; Zhang, Shiyin ; Qiu, Shuang ; Zhao, Yao</creatorcontrib><description>Composed image retrieval (CIR) aims at fusing a reference image and text feedback to search for the desired images. Compared to general image retrieval, it can model the users' search intent more comprehensively and search the target images more accurately, which has significant impacts in various real-world applications, such as E-commerce and Internet search. However, because of the existing heterogeneous semantic gap, the synthetic understanding and fusion of both image and text are difficult to implement. In this work, to tackle this difficult problem, we propose an end-to-end framework MCR, which uses text and images as retrieval queries. The framework mainly includes four pivotal modules. Specifically, we introduce the Relative Caption-aware Consistency (RCC) constraint to align text pieces and images in the database, which can effectually bridge the heterogeneous gap. The Multi-modal Complementary Fusion (MCF) and Cross-modal Guided Pooling (CGP) are constructed to mine multiple interactions between image local features and text word features and learn the complementary representation of the composed query. Furthermore, we develop a plug-and-play Weak-text Semantic Augment (WSA) module for datasets with short or incomplete query texts, which can supplement the weak-text features and is conducive to modeling an augmented semantic space. Extensive experiments demonstrate the practical superior performance over the existing state-of-the-art empirical algorithms on several benchmarks.</description><identifier>ISSN: 1520-9210</identifier><identifier>EISSN: 1941-0077</identifier><identifier>DOI: 10.1109/TMM.2022.3208742</identifier><identifier>CODEN: ITMUF8</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Algorithms ; Composed Image Retrieval ; Embedding Fusion ; Feature extraction ; Fuses ; Image retrieval ; Modules ; Multi-modal learning ; Searching ; Semantics ; Task analysis ; Transformers ; Visualization</subject><ispartof>IEEE transactions on multimedia, 2023-01, Vol.25, p.1-12</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c244t-6d47f24b8b46d77fb3bf014f919d1e9ff8da7a78c48f836729363ced22d9b6593</cites><orcidid>0000-0003-3803-9763 ; 0000-0003-1503-4513 ; 0000-0002-2128-0190 ; 0000-0002-9769-5042 ; 0000-0002-8581-9554 ; 0000-0003-4998-4576</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9899752$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9899752$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Pang, Huaxin</creatorcontrib><creatorcontrib>Wei, Shikui</creatorcontrib><creatorcontrib>Zhang, Gangjian</creatorcontrib><creatorcontrib>Zhang, Shiyin</creatorcontrib><creatorcontrib>Qiu, Shuang</creatorcontrib><creatorcontrib>Zhao, Yao</creatorcontrib><title>Heterogeneous Feature Alignment and Fusion in Cross-Modal Augmented Space for Composed Image Retrieval</title><title>IEEE transactions on multimedia</title><addtitle>TMM</addtitle><description>Composed image retrieval (CIR) aims at fusing a reference image and text feedback to search for the desired images. Compared to general image retrieval, it can model the users' search intent more comprehensively and search the target images more accurately, which has significant impacts in various real-world applications, such as E-commerce and Internet search. However, because of the existing heterogeneous semantic gap, the synthetic understanding and fusion of both image and text are difficult to implement. In this work, to tackle this difficult problem, we propose an end-to-end framework MCR, which uses text and images as retrieval queries. The framework mainly includes four pivotal modules. Specifically, we introduce the Relative Caption-aware Consistency (RCC) constraint to align text pieces and images in the database, which can effectually bridge the heterogeneous gap. The Multi-modal Complementary Fusion (MCF) and Cross-modal Guided Pooling (CGP) are constructed to mine multiple interactions between image local features and text word features and learn the complementary representation of the composed query. Furthermore, we develop a plug-and-play Weak-text Semantic Augment (WSA) module for datasets with short or incomplete query texts, which can supplement the weak-text features and is conducive to modeling an augmented semantic space. Extensive experiments demonstrate the practical superior performance over the existing state-of-the-art empirical algorithms on several benchmarks.</description><subject>Algorithms</subject><subject>Composed Image Retrieval</subject><subject>Embedding Fusion</subject><subject>Feature extraction</subject><subject>Fuses</subject><subject>Image retrieval</subject><subject>Modules</subject><subject>Multi-modal learning</subject><subject>Searching</subject><subject>Semantics</subject><subject>Task analysis</subject><subject>Transformers</subject><subject>Visualization</subject><issn>1520-9210</issn><issn>1941-0077</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kM9LwzAUgIMoOKd3wUvAc2eSpk1yHMW5wYag81zS5qV0tE1NWsH_3pYNT-_x-N6vD6FHSlaUEvVyPBxWjDC2ihmRgrMrtKCK04gQIa6nPGEkUoySW3QXwokQyhMiFshuYQDvKujAjQFvQA-jB7xu6qproRuw7gzejKF2Ha47nHkXQnRwRjd4PVYzAQZ_9roEbJ3HmWt7F6bSrtUV4A8YfA0_urlHN1Y3AR4ucYm-Nq_HbBvt39922XoflYzzIUoNF5bxQhY8NULYIi7sdKlVVBkKylpptNBCllxaGaeCqTiNSzCMGVWkiYqX6Pk8t_fue4Qw5Cc3-m5amTMpk5QrIeVEkTNVzu94sHnv61b735ySfLaZTzbz2WZ-sTm1PJ1bagD4x5VUSiQs_gMOnnDX</recordid><startdate>20230101</startdate><enddate>20230101</enddate><creator>Pang, Huaxin</creator><creator>Wei, Shikui</creator><creator>Zhang, Gangjian</creator><creator>Zhang, Shiyin</creator><creator>Qiu, Shuang</creator><creator>Zhao, Yao</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0003-3803-9763</orcidid><orcidid>https://orcid.org/0000-0003-1503-4513</orcidid><orcidid>https://orcid.org/0000-0002-2128-0190</orcidid><orcidid>https://orcid.org/0000-0002-9769-5042</orcidid><orcidid>https://orcid.org/0000-0002-8581-9554</orcidid><orcidid>https://orcid.org/0000-0003-4998-4576</orcidid></search><sort><creationdate>20230101</creationdate><title>Heterogeneous Feature Alignment and Fusion in Cross-Modal Augmented Space for Composed Image Retrieval</title><author>Pang, Huaxin ; Wei, Shikui ; Zhang, Gangjian ; Zhang, Shiyin ; Qiu, Shuang ; Zhao, Yao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c244t-6d47f24b8b46d77fb3bf014f919d1e9ff8da7a78c48f836729363ced22d9b6593</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Algorithms</topic><topic>Composed Image Retrieval</topic><topic>Embedding Fusion</topic><topic>Feature extraction</topic><topic>Fuses</topic><topic>Image retrieval</topic><topic>Modules</topic><topic>Multi-modal learning</topic><topic>Searching</topic><topic>Semantics</topic><topic>Task analysis</topic><topic>Transformers</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Pang, Huaxin</creatorcontrib><creatorcontrib>Wei, Shikui</creatorcontrib><creatorcontrib>Zhang, Gangjian</creatorcontrib><creatorcontrib>Zhang, Shiyin</creatorcontrib><creatorcontrib>Qiu, Shuang</creatorcontrib><creatorcontrib>Zhao, Yao</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on multimedia</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Pang, Huaxin</au><au>Wei, Shikui</au><au>Zhang, Gangjian</au><au>Zhang, Shiyin</au><au>Qiu, Shuang</au><au>Zhao, Yao</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Heterogeneous Feature Alignment and Fusion in Cross-Modal Augmented Space for Composed Image Retrieval</atitle><jtitle>IEEE transactions on multimedia</jtitle><stitle>TMM</stitle><date>2023-01-01</date><risdate>2023</risdate><volume>25</volume><spage>1</spage><epage>12</epage><pages>1-12</pages><issn>1520-9210</issn><eissn>1941-0077</eissn><coden>ITMUF8</coden><abstract>Composed image retrieval (CIR) aims at fusing a reference image and text feedback to search for the desired images. Compared to general image retrieval, it can model the users' search intent more comprehensively and search the target images more accurately, which has significant impacts in various real-world applications, such as E-commerce and Internet search. However, because of the existing heterogeneous semantic gap, the synthetic understanding and fusion of both image and text are difficult to implement. In this work, to tackle this difficult problem, we propose an end-to-end framework MCR, which uses text and images as retrieval queries. The framework mainly includes four pivotal modules. Specifically, we introduce the Relative Caption-aware Consistency (RCC) constraint to align text pieces and images in the database, which can effectually bridge the heterogeneous gap. The Multi-modal Complementary Fusion (MCF) and Cross-modal Guided Pooling (CGP) are constructed to mine multiple interactions between image local features and text word features and learn the complementary representation of the composed query. Furthermore, we develop a plug-and-play Weak-text Semantic Augment (WSA) module for datasets with short or incomplete query texts, which can supplement the weak-text features and is conducive to modeling an augmented semantic space. Extensive experiments demonstrate the practical superior performance over the existing state-of-the-art empirical algorithms on several benchmarks.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TMM.2022.3208742</doi><tpages>12</tpages><orcidid>https://orcid.org/0000-0003-3803-9763</orcidid><orcidid>https://orcid.org/0000-0003-1503-4513</orcidid><orcidid>https://orcid.org/0000-0002-2128-0190</orcidid><orcidid>https://orcid.org/0000-0002-9769-5042</orcidid><orcidid>https://orcid.org/0000-0002-8581-9554</orcidid><orcidid>https://orcid.org/0000-0003-4998-4576</orcidid></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 1520-9210 |
ispartof | IEEE transactions on multimedia, 2023-01, Vol.25, p.1-12 |
issn | 1520-9210 1941-0077 |
language | eng |
recordid | cdi_ieee_primary_9899752 |
source | IEEE Electronic Library (IEL) |
subjects | Algorithms Composed Image Retrieval Embedding Fusion Feature extraction Fuses Image retrieval Modules Multi-modal learning Searching Semantics Task analysis Transformers Visualization |
title | Heterogeneous Feature Alignment and Fusion in Cross-Modal Augmented Space for Composed Image Retrieval |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-31T07%3A08%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Heterogeneous%20Feature%20Alignment%20and%20Fusion%20in%20Cross-Modal%20Augmented%20Space%20for%20Composed%20Image%20Retrieval&rft.jtitle=IEEE%20transactions%20on%20multimedia&rft.au=Pang,%20Huaxin&rft.date=2023-01-01&rft.volume=25&rft.spage=1&rft.epage=12&rft.pages=1-12&rft.issn=1520-9210&rft.eissn=1941-0077&rft.coden=ITMUF8&rft_id=info:doi/10.1109/TMM.2022.3208742&rft_dat=%3Cproquest_RIE%3E2885649788%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2885649788&rft_id=info:pmid/&rft_ieee_id=9899752&rfr_iscdi=true |