Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image Retrieval

In this article, we study the challenging cross-modal image retrieval task, Composed Query-Based Image Retrieval (CQBIR), in which the query is not a single text query but a composed query, i.e., a reference image, and a modification text. Compared with the conventional cross-modal image-text retrie...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	ACM transactions on multimedia computing communications and applications 2024-03, Vol.20 (6), p.1-22, Article 165
Hauptverfasser:	Li, Shenshen, Xu, Xing, Jiang, Xun, Shen, Fumin, Sun, Zhe, Cichocki, Andrzej
Format:	Artikel
Sprache:	eng
Schlagworte:	Computing methodologies Information retrieval Information systems Visual content-based indexing and retrieval
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	22
container_issue	6
container_start_page	1
container_title	ACM transactions on multimedia computing communications and applications
container_volume	20
creator	Li, Shenshen Xu, Xing Jiang, Xun Shen, Fumin Sun, Zhe Cichocki, Andrzej
description	In this article, we study the challenging cross-modal image retrieval task, Composed Query-Based Image Retrieval (CQBIR), in which the query is not a single text query but a composed query, i.e., a reference image, and a modification text. Compared with the conventional cross-modal image-text retrieval task, the CQBIR is more challenging as it requires properly preserving and modifying the specific image region according to the multi-level semantic information learned from the multi-modal query. Most recent works focus on extracting preserved and modified information and compositing it into a unified representation. However, we observe that the preserved regions learned by the existing methods contain redundant modified information, inevitably degrading the overall retrieval performance. To this end, we propose a novel method termed Cross-Modal Attention Preservation (CMAP). Specifically, we first leverage the cross-level interaction to fully account for multi-granular semantic information, which aims to supplement the high-level semantics for effective image retrieval. Furthermore, different from conventional contrastive learning, our method introduces self-contrastive learning into learning preserved information, to prevent the model from confusing the attention for the preserved part with the modified part. Extensive experiments on three widely used CQBIR datasets, i.e., FashionIQ, Shoes, and Fashion200k, demonstrate that our proposed CMAP method significantly outperforms the current state-of-the-art methods on all the datasets. The anonymous implementation code of our CMAP method is available at https://github.com/CFM-MSG/Code_CMAP.
doi_str_mv	10.1145/3639469
format	Article
fullrecord	<record><control><sourceid>acm_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3639469</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3639469</sourcerecordid><originalsourceid>FETCH-LOGICAL-a239t-e03f5a826b1487190814edf10237da727d1a88ca87365c90221853c777647be83</originalsourceid><addsrcrecordid>eNo9kEtPwzAQhC0EEqUg7px842Sw4_iRY4koVCrifY62yaakSuLKNkX997S09DQz2k8rzRByKfiNEKm6lVpmqc6OyEAoJZi2Wh0fvDKn5CyEBedSq1QPyCL3LgT25Cpo6ShG7GPjevriMaBfwV_4aeIXfce2Zrnro4cQmxXSKYLvm35Oa-dp7rqlC1jR12_0a3YHWz_pYI70DaNvcAXtOTmpoQ14sdch-Rzff-SPbPr8MMlHUwaJzCJDLmsFNtEzkVojMm5FilUteCJNBSYxlQBrS7BmU6HMeJIIq2RpjNGpmaGVQ3K9-1tuq3msi6VvOvDrQvBiO1Gxn2hDXu1IKLsD9H_8BSazYN4</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image Retrieval</title><source>Access via ACM Digital Library</source><creator>Li, Shenshen ; Xu, Xing ; Jiang, Xun ; Shen, Fumin ; Sun, Zhe ; Cichocki, Andrzej</creator><creatorcontrib>Li, Shenshen ; Xu, Xing ; Jiang, Xun ; Shen, Fumin ; Sun, Zhe ; Cichocki, Andrzej</creatorcontrib><description>In this article, we study the challenging cross-modal image retrieval task, Composed Query-Based Image Retrieval (CQBIR), in which the query is not a single text query but a composed query, i.e., a reference image, and a modification text. Compared with the conventional cross-modal image-text retrieval task, the CQBIR is more challenging as it requires properly preserving and modifying the specific image region according to the multi-level semantic information learned from the multi-modal query. Most recent works focus on extracting preserved and modified information and compositing it into a unified representation. However, we observe that the preserved regions learned by the existing methods contain redundant modified information, inevitably degrading the overall retrieval performance. To this end, we propose a novel method termed Cross-Modal Attention Preservation (CMAP). Specifically, we first leverage the cross-level interaction to fully account for multi-granular semantic information, which aims to supplement the high-level semantics for effective image retrieval. Furthermore, different from conventional contrastive learning, our method introduces self-contrastive learning into learning preserved information, to prevent the model from confusing the attention for the preserved part with the modified part. Extensive experiments on three widely used CQBIR datasets, i.e., FashionIQ, Shoes, and Fashion200k, demonstrate that our proposed CMAP method significantly outperforms the current state-of-the-art methods on all the datasets. The anonymous implementation code of our CMAP method is available at https://github.com/CFM-MSG/Code_CMAP.</description><identifier>ISSN: 1551-6857</identifier><identifier>EISSN: 1551-6865</identifier><identifier>DOI: 10.1145/3639469</identifier><language>eng</language><publisher>New York, NY: ACM</publisher><subject>Computing methodologies ; Information retrieval ; Information systems ; Visual content-based indexing and retrieval</subject><ispartof>ACM transactions on multimedia computing communications and applications, 2024-03, Vol.20 (6), p.1-22, Article 165</ispartof><rights>Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-a239t-e03f5a826b1487190814edf10237da727d1a88ca87365c90221853c777647be83</cites><orcidid>0000-0002-6340-012X ; 0000-0002-6531-0769 ; 0000-0002-8364-7226 ; 0000-0003-2209-651X ; 0000-0001-7303-3231 ; 0000-0001-5685-3123</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://dl.acm.org/doi/pdf/10.1145/3639469$$EPDF$$P50$$Gacm$$H</linktopdf><link.rule.ids>314,780,784,2282,27924,27925,40196,76228</link.rule.ids></links><search><creatorcontrib>Li, Shenshen</creatorcontrib><creatorcontrib>Xu, Xing</creatorcontrib><creatorcontrib>Jiang, Xun</creatorcontrib><creatorcontrib>Shen, Fumin</creatorcontrib><creatorcontrib>Sun, Zhe</creatorcontrib><creatorcontrib>Cichocki, Andrzej</creatorcontrib><title>Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image Retrieval</title><title>ACM transactions on multimedia computing communications and applications</title><addtitle>ACM TOMM</addtitle><description>In this article, we study the challenging cross-modal image retrieval task, Composed Query-Based Image Retrieval (CQBIR), in which the query is not a single text query but a composed query, i.e., a reference image, and a modification text. Compared with the conventional cross-modal image-text retrieval task, the CQBIR is more challenging as it requires properly preserving and modifying the specific image region according to the multi-level semantic information learned from the multi-modal query. Most recent works focus on extracting preserved and modified information and compositing it into a unified representation. However, we observe that the preserved regions learned by the existing methods contain redundant modified information, inevitably degrading the overall retrieval performance. To this end, we propose a novel method termed Cross-Modal Attention Preservation (CMAP). Specifically, we first leverage the cross-level interaction to fully account for multi-granular semantic information, which aims to supplement the high-level semantics for effective image retrieval. Furthermore, different from conventional contrastive learning, our method introduces self-contrastive learning into learning preserved information, to prevent the model from confusing the attention for the preserved part with the modified part. Extensive experiments on three widely used CQBIR datasets, i.e., FashionIQ, Shoes, and Fashion200k, demonstrate that our proposed CMAP method significantly outperforms the current state-of-the-art methods on all the datasets. The anonymous implementation code of our CMAP method is available at https://github.com/CFM-MSG/Code_CMAP.</description><subject>Computing methodologies</subject><subject>Information retrieval</subject><subject>Information systems</subject><subject>Visual content-based indexing and retrieval</subject><issn>1551-6857</issn><issn>1551-6865</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNo9kEtPwzAQhC0EEqUg7px842Sw4_iRY4koVCrifY62yaakSuLKNkX997S09DQz2k8rzRByKfiNEKm6lVpmqc6OyEAoJZi2Wh0fvDKn5CyEBedSq1QPyCL3LgT25Cpo6ShG7GPjevriMaBfwV_4aeIXfce2Zrnro4cQmxXSKYLvm35Oa-dp7rqlC1jR12_0a3YHWz_pYI70DaNvcAXtOTmpoQ14sdch-Rzff-SPbPr8MMlHUwaJzCJDLmsFNtEzkVojMm5FilUteCJNBSYxlQBrS7BmU6HMeJIIq2RpjNGpmaGVQ3K9-1tuq3msi6VvOvDrQvBiO1Gxn2hDXu1IKLsD9H_8BSazYN4</recordid><startdate>20240308</startdate><enddate>20240308</enddate><creator>Li, Shenshen</creator><creator>Xu, Xing</creator><creator>Jiang, Xun</creator><creator>Shen, Fumin</creator><creator>Sun, Zhe</creator><creator>Cichocki, Andrzej</creator><general>ACM</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-6340-012X</orcidid><orcidid>https://orcid.org/0000-0002-6531-0769</orcidid><orcidid>https://orcid.org/0000-0002-8364-7226</orcidid><orcidid>https://orcid.org/0000-0003-2209-651X</orcidid><orcidid>https://orcid.org/0000-0001-7303-3231</orcidid><orcidid>https://orcid.org/0000-0001-5685-3123</orcidid></search><sort><creationdate>20240308</creationdate><title>Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image Retrieval</title><author>Li, Shenshen ; Xu, Xing ; Jiang, Xun ; Shen, Fumin ; Sun, Zhe ; Cichocki, Andrzej</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a239t-e03f5a826b1487190814edf10237da727d1a88ca87365c90221853c777647be83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computing methodologies</topic><topic>Information retrieval</topic><topic>Information systems</topic><topic>Visual content-based indexing and retrieval</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Li, Shenshen</creatorcontrib><creatorcontrib>Xu, Xing</creatorcontrib><creatorcontrib>Jiang, Xun</creatorcontrib><creatorcontrib>Shen, Fumin</creatorcontrib><creatorcontrib>Sun, Zhe</creatorcontrib><creatorcontrib>Cichocki, Andrzej</creatorcontrib><collection>CrossRef</collection><jtitle>ACM transactions on multimedia computing communications and applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Li, Shenshen</au><au>Xu, Xing</au><au>Jiang, Xun</au><au>Shen, Fumin</au><au>Sun, Zhe</au><au>Cichocki, Andrzej</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image Retrieval</atitle><jtitle>ACM transactions on multimedia computing communications and applications</jtitle><stitle>ACM TOMM</stitle><date>2024-03-08</date><risdate>2024</risdate><volume>20</volume><issue>6</issue><spage>1</spage><epage>22</epage><pages>1-22</pages><artnum>165</artnum><issn>1551-6857</issn><eissn>1551-6865</eissn><abstract>In this article, we study the challenging cross-modal image retrieval task, Composed Query-Based Image Retrieval (CQBIR), in which the query is not a single text query but a composed query, i.e., a reference image, and a modification text. Compared with the conventional cross-modal image-text retrieval task, the CQBIR is more challenging as it requires properly preserving and modifying the specific image region according to the multi-level semantic information learned from the multi-modal query. Most recent works focus on extracting preserved and modified information and compositing it into a unified representation. However, we observe that the preserved regions learned by the existing methods contain redundant modified information, inevitably degrading the overall retrieval performance. To this end, we propose a novel method termed Cross-Modal Attention Preservation (CMAP). Specifically, we first leverage the cross-level interaction to fully account for multi-granular semantic information, which aims to supplement the high-level semantics for effective image retrieval. Furthermore, different from conventional contrastive learning, our method introduces self-contrastive learning into learning preserved information, to prevent the model from confusing the attention for the preserved part with the modified part. Extensive experiments on three widely used CQBIR datasets, i.e., FashionIQ, Shoes, and Fashion200k, demonstrate that our proposed CMAP method significantly outperforms the current state-of-the-art methods on all the datasets. The anonymous implementation code of our CMAP method is available at https://github.com/CFM-MSG/Code_CMAP.</abstract><cop>New York, NY</cop><pub>ACM</pub><doi>10.1145/3639469</doi><tpages>22</tpages><orcidid>https://orcid.org/0000-0002-6340-012X</orcidid><orcidid>https://orcid.org/0000-0002-6531-0769</orcidid><orcidid>https://orcid.org/0000-0002-8364-7226</orcidid><orcidid>https://orcid.org/0000-0003-2209-651X</orcidid><orcidid>https://orcid.org/0000-0001-7303-3231</orcidid><orcidid>https://orcid.org/0000-0001-5685-3123</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1551-6857
ispartof	ACM transactions on multimedia computing communications and applications, 2024-03, Vol.20 (6), p.1-22, Article 165
issn	1551-6857 1551-6865
language	eng
recordid	cdi_crossref_primary_10_1145_3639469
source	Access via ACM Digital Library
subjects	Computing methodologies Information retrieval Information systems Visual content-based indexing and retrieval
title	Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image Retrieval
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-28T12%3A42%3A11IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Cross-Modal%20Attention%20Preservation%20with%20Self-Contrastive%20Learning%20for%20Composed%20Query-Based%20Image%20Retrieval&rft.jtitle=ACM%20transactions%20on%20multimedia%20computing%20communications%20and%20applications&rft.au=Li,%20Shenshen&rft.date=2024-03-08&rft.volume=20&rft.issue=6&rft.spage=1&rft.epage=22&rft.pages=1-22&rft.artnum=165&rft.issn=1551-6857&rft.eissn=1551-6865&rft_id=info:doi/10.1145/3639469&rft_dat=%3Cacm_cross%3E3639469%3C/acm_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true