Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image Retrieval

In this article, we study the challenging cross-modal image retrieval task, Composed Query-Based Image Retrieval (CQBIR), in which the query is not a single text query but a composed query, i.e., a reference image, and a modification text. Compared with the conventional cross-modal image-text retrie...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:ACM transactions on multimedia computing communications and applications 2024-03, Vol.20 (6), p.1-22, Article 165
Hauptverfasser: Li, Shenshen, Xu, Xing, Jiang, Xun, Shen, Fumin, Sun, Zhe, Cichocki, Andrzej
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 22
container_issue 6
container_start_page 1
container_title ACM transactions on multimedia computing communications and applications
container_volume 20
creator Li, Shenshen
Xu, Xing
Jiang, Xun
Shen, Fumin
Sun, Zhe
Cichocki, Andrzej
description In this article, we study the challenging cross-modal image retrieval task, Composed Query-Based Image Retrieval (CQBIR), in which the query is not a single text query but a composed query, i.e., a reference image, and a modification text. Compared with the conventional cross-modal image-text retrieval task, the CQBIR is more challenging as it requires properly preserving and modifying the specific image region according to the multi-level semantic information learned from the multi-modal query. Most recent works focus on extracting preserved and modified information and compositing it into a unified representation. However, we observe that the preserved regions learned by the existing methods contain redundant modified information, inevitably degrading the overall retrieval performance. To this end, we propose a novel method termed Cross-Modal Attention Preservation (CMAP). Specifically, we first leverage the cross-level interaction to fully account for multi-granular semantic information, which aims to supplement the high-level semantics for effective image retrieval. Furthermore, different from conventional contrastive learning, our method introduces self-contrastive learning into learning preserved information, to prevent the model from confusing the attention for the preserved part with the modified part. Extensive experiments on three widely used CQBIR datasets, i.e., FashionIQ, Shoes, and Fashion200k, demonstrate that our proposed CMAP method significantly outperforms the current state-of-the-art methods on all the datasets. The anonymous implementation code of our CMAP method is available at https://github.com/CFM-MSG/Code_CMAP.
doi_str_mv 10.1145/3639469
format Article
fullrecord <record><control><sourceid>acm_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3639469</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3639469</sourcerecordid><originalsourceid>FETCH-LOGICAL-a239t-e03f5a826b1487190814edf10237da727d1a88ca87365c90221853c777647be83</originalsourceid><addsrcrecordid>eNo9kEtPwzAQhC0EEqUg7px842Sw4_iRY4koVCrifY62yaakSuLKNkX997S09DQz2k8rzRByKfiNEKm6lVpmqc6OyEAoJZi2Wh0fvDKn5CyEBedSq1QPyCL3LgT25Cpo6ShG7GPjevriMaBfwV_4aeIXfce2Zrnro4cQmxXSKYLvm35Oa-dp7rqlC1jR12_0a3YHWz_pYI70DaNvcAXtOTmpoQ14sdch-Rzff-SPbPr8MMlHUwaJzCJDLmsFNtEzkVojMm5FilUteCJNBSYxlQBrS7BmU6HMeJIIq2RpjNGpmaGVQ3K9-1tuq3msi6VvOvDrQvBiO1Gxn2hDXu1IKLsD9H_8BSazYN4</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image Retrieval</title><source>Access via ACM Digital Library</source><creator>Li, Shenshen ; Xu, Xing ; Jiang, Xun ; Shen, Fumin ; Sun, Zhe ; Cichocki, Andrzej</creator><creatorcontrib>Li, Shenshen ; Xu, Xing ; Jiang, Xun ; Shen, Fumin ; Sun, Zhe ; Cichocki, Andrzej</creatorcontrib><description>In this article, we study the challenging cross-modal image retrieval task, Composed Query-Based Image Retrieval (CQBIR), in which the query is not a single text query but a composed query, i.e., a reference image, and a modification text. Compared with the conventional cross-modal image-text retrieval task, the CQBIR is more challenging as it requires properly preserving and modifying the specific image region according to the multi-level semantic information learned from the multi-modal query. Most recent works focus on extracting preserved and modified information and compositing it into a unified representation. However, we observe that the preserved regions learned by the existing methods contain redundant modified information, inevitably degrading the overall retrieval performance. To this end, we propose a novel method termed Cross-Modal Attention Preservation (CMAP). Specifically, we first leverage the cross-level interaction to fully account for multi-granular semantic information, which aims to supplement the high-level semantics for effective image retrieval. Furthermore, different from conventional contrastive learning, our method introduces self-contrastive learning into learning preserved information, to prevent the model from confusing the attention for the preserved part with the modified part. Extensive experiments on three widely used CQBIR datasets, i.e., FashionIQ, Shoes, and Fashion200k, demonstrate that our proposed CMAP method significantly outperforms the current state-of-the-art methods on all the datasets. The anonymous implementation code of our CMAP method is available at https://github.com/CFM-MSG/Code_CMAP.</description><identifier>ISSN: 1551-6857</identifier><identifier>EISSN: 1551-6865</identifier><identifier>DOI: 10.1145/3639469</identifier><language>eng</language><publisher>New York, NY: ACM</publisher><subject>Computing methodologies ; Information retrieval ; Information systems ; Visual content-based indexing and retrieval</subject><ispartof>ACM transactions on multimedia computing communications and applications, 2024-03, Vol.20 (6), p.1-22, Article 165</ispartof><rights>Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-a239t-e03f5a826b1487190814edf10237da727d1a88ca87365c90221853c777647be83</cites><orcidid>0000-0002-6340-012X ; 0000-0002-6531-0769 ; 0000-0002-8364-7226 ; 0000-0003-2209-651X ; 0000-0001-7303-3231 ; 0000-0001-5685-3123</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://dl.acm.org/doi/pdf/10.1145/3639469$$EPDF$$P50$$Gacm$$H</linktopdf><link.rule.ids>314,780,784,2282,27924,27925,40196,76228</link.rule.ids></links><search><creatorcontrib>Li, Shenshen</creatorcontrib><creatorcontrib>Xu, Xing</creatorcontrib><creatorcontrib>Jiang, Xun</creatorcontrib><creatorcontrib>Shen, Fumin</creatorcontrib><creatorcontrib>Sun, Zhe</creatorcontrib><creatorcontrib>Cichocki, Andrzej</creatorcontrib><title>Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image Retrieval</title><title>ACM transactions on multimedia computing communications and applications</title><addtitle>ACM TOMM</addtitle><description>In this article, we study the challenging cross-modal image retrieval task, Composed Query-Based Image Retrieval (CQBIR), in which the query is not a single text query but a composed query, i.e., a reference image, and a modification text. Compared with the conventional cross-modal image-text retrieval task, the CQBIR is more challenging as it requires properly preserving and modifying the specific image region according to the multi-level semantic information learned from the multi-modal query. Most recent works focus on extracting preserved and modified information and compositing it into a unified representation. However, we observe that the preserved regions learned by the existing methods contain redundant modified information, inevitably degrading the overall retrieval performance. To this end, we propose a novel method termed Cross-Modal Attention Preservation (CMAP). Specifically, we first leverage the cross-level interaction to fully account for multi-granular semantic information, which aims to supplement the high-level semantics for effective image retrieval. Furthermore, different from conventional contrastive learning, our method introduces self-contrastive learning into learning preserved information, to prevent the model from confusing the attention for the preserved part with the modified part. Extensive experiments on three widely used CQBIR datasets, i.e., FashionIQ, Shoes, and Fashion200k, demonstrate that our proposed CMAP method significantly outperforms the current state-of-the-art methods on all the datasets. The anonymous implementation code of our CMAP method is available at https://github.com/CFM-MSG/Code_CMAP.</description><subject>Computing methodologies</subject><subject>Information retrieval</subject><subject>Information systems</subject><subject>Visual content-based indexing and retrieval</subject><issn>1551-6857</issn><issn>1551-6865</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNo9kEtPwzAQhC0EEqUg7px842Sw4_iRY4koVCrifY62yaakSuLKNkX997S09DQz2k8rzRByKfiNEKm6lVpmqc6OyEAoJZi2Wh0fvDKn5CyEBedSq1QPyCL3LgT25Cpo6ShG7GPjevriMaBfwV_4aeIXfce2Zrnro4cQmxXSKYLvm35Oa-dp7rqlC1jR12_0a3YHWz_pYI70DaNvcAXtOTmpoQ14sdch-Rzff-SPbPr8MMlHUwaJzCJDLmsFNtEzkVojMm5FilUteCJNBSYxlQBrS7BmU6HMeJIIq2RpjNGpmaGVQ3K9-1tuq3msi6VvOvDrQvBiO1Gxn2hDXu1IKLsD9H_8BSazYN4</recordid><startdate>20240308</startdate><enddate>20240308</enddate><creator>Li, Shenshen</creator><creator>Xu, Xing</creator><creator>Jiang, Xun</creator><creator>Shen, Fumin</creator><creator>Sun, Zhe</creator><creator>Cichocki, Andrzej</creator><general>ACM</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-6340-012X</orcidid><orcidid>https://orcid.org/0000-0002-6531-0769</orcidid><orcidid>https://orcid.org/0000-0002-8364-7226</orcidid><orcidid>https://orcid.org/0000-0003-2209-651X</orcidid><orcidid>https://orcid.org/0000-0001-7303-3231</orcidid><orcidid>https://orcid.org/0000-0001-5685-3123</orcidid></search><sort><creationdate>20240308</creationdate><title>Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image Retrieval</title><author>Li, Shenshen ; Xu, Xing ; Jiang, Xun ; Shen, Fumin ; Sun, Zhe ; Cichocki, Andrzej</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a239t-e03f5a826b1487190814edf10237da727d1a88ca87365c90221853c777647be83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computing methodologies</topic><topic>Information retrieval</topic><topic>Information systems</topic><topic>Visual content-based indexing and retrieval</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Li, Shenshen</creatorcontrib><creatorcontrib>Xu, Xing</creatorcontrib><creatorcontrib>Jiang, Xun</creatorcontrib><creatorcontrib>Shen, Fumin</creatorcontrib><creatorcontrib>Sun, Zhe</creatorcontrib><creatorcontrib>Cichocki, Andrzej</creatorcontrib><collection>CrossRef</collection><jtitle>ACM transactions on multimedia computing communications and applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Li, Shenshen</au><au>Xu, Xing</au><au>Jiang, Xun</au><au>Shen, Fumin</au><au>Sun, Zhe</au><au>Cichocki, Andrzej</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image Retrieval</atitle><jtitle>ACM transactions on multimedia computing communications and applications</jtitle><stitle>ACM TOMM</stitle><date>2024-03-08</date><risdate>2024</risdate><volume>20</volume><issue>6</issue><spage>1</spage><epage>22</epage><pages>1-22</pages><artnum>165</artnum><issn>1551-6857</issn><eissn>1551-6865</eissn><abstract>In this article, we study the challenging cross-modal image retrieval task, Composed Query-Based Image Retrieval (CQBIR), in which the query is not a single text query but a composed query, i.e., a reference image, and a modification text. Compared with the conventional cross-modal image-text retrieval task, the CQBIR is more challenging as it requires properly preserving and modifying the specific image region according to the multi-level semantic information learned from the multi-modal query. Most recent works focus on extracting preserved and modified information and compositing it into a unified representation. However, we observe that the preserved regions learned by the existing methods contain redundant modified information, inevitably degrading the overall retrieval performance. To this end, we propose a novel method termed Cross-Modal Attention Preservation (CMAP). Specifically, we first leverage the cross-level interaction to fully account for multi-granular semantic information, which aims to supplement the high-level semantics for effective image retrieval. Furthermore, different from conventional contrastive learning, our method introduces self-contrastive learning into learning preserved information, to prevent the model from confusing the attention for the preserved part with the modified part. Extensive experiments on three widely used CQBIR datasets, i.e., FashionIQ, Shoes, and Fashion200k, demonstrate that our proposed CMAP method significantly outperforms the current state-of-the-art methods on all the datasets. The anonymous implementation code of our CMAP method is available at https://github.com/CFM-MSG/Code_CMAP.</abstract><cop>New York, NY</cop><pub>ACM</pub><doi>10.1145/3639469</doi><tpages>22</tpages><orcidid>https://orcid.org/0000-0002-6340-012X</orcidid><orcidid>https://orcid.org/0000-0002-6531-0769</orcidid><orcidid>https://orcid.org/0000-0002-8364-7226</orcidid><orcidid>https://orcid.org/0000-0003-2209-651X</orcidid><orcidid>https://orcid.org/0000-0001-7303-3231</orcidid><orcidid>https://orcid.org/0000-0001-5685-3123</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1551-6857
ispartof ACM transactions on multimedia computing communications and applications, 2024-03, Vol.20 (6), p.1-22, Article 165
issn 1551-6857
1551-6865
language eng
recordid cdi_crossref_primary_10_1145_3639469
source Access via ACM Digital Library
subjects Computing methodologies
Information retrieval
Information systems
Visual content-based indexing and retrieval
title Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image Retrieval
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-28T12%3A42%3A11IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Cross-Modal%20Attention%20Preservation%20with%20Self-Contrastive%20Learning%20for%20Composed%20Query-Based%20Image%20Retrieval&rft.jtitle=ACM%20transactions%20on%20multimedia%20computing%20communications%20and%20applications&rft.au=Li,%20Shenshen&rft.date=2024-03-08&rft.volume=20&rft.issue=6&rft.spage=1&rft.epage=22&rft.pages=1-22&rft.artnum=165&rft.issn=1551-6857&rft.eissn=1551-6865&rft_id=info:doi/10.1145/3639469&rft_dat=%3Cacm_cross%3E3639469%3C/acm_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true