Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image Retrieval
In this article, we study the challenging cross-modal image retrieval task, Composed Query-Based Image Retrieval (CQBIR), in which the query is not a single text query but a composed query, i.e., a reference image, and a modification text. Compared with the conventional cross-modal image-text retrie...
Gespeichert in:
Veröffentlicht in: | ACM transactions on multimedia computing communications and applications 2024-03, Vol.20 (6), p.1-22, Article 165 |
---|---|
Hauptverfasser: | , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 22 |
---|---|
container_issue | 6 |
container_start_page | 1 |
container_title | ACM transactions on multimedia computing communications and applications |
container_volume | 20 |
creator | Li, Shenshen Xu, Xing Jiang, Xun Shen, Fumin Sun, Zhe Cichocki, Andrzej |
description | In this article, we study the challenging cross-modal image retrieval task, Composed Query-Based Image Retrieval (CQBIR), in which the query is not a single text query but a composed query, i.e., a reference image, and a modification text. Compared with the conventional cross-modal image-text retrieval task, the CQBIR is more challenging as it requires properly preserving and modifying the specific image region according to the multi-level semantic information learned from the multi-modal query. Most recent works focus on extracting preserved and modified information and compositing it into a unified representation. However, we observe that the preserved regions learned by the existing methods contain redundant modified information, inevitably degrading the overall retrieval performance. To this end, we propose a novel method termed Cross-Modal Attention Preservation (CMAP). Specifically, we first leverage the cross-level interaction to fully account for multi-granular semantic information, which aims to supplement the high-level semantics for effective image retrieval. Furthermore, different from conventional contrastive learning, our method introduces self-contrastive learning into learning preserved information, to prevent the model from confusing the attention for the preserved part with the modified part. Extensive experiments on three widely used CQBIR datasets, i.e., FashionIQ, Shoes, and Fashion200k, demonstrate that our proposed CMAP method significantly outperforms the current state-of-the-art methods on all the datasets. The anonymous implementation code of our CMAP method is available at https://github.com/CFM-MSG/Code_CMAP. |
doi_str_mv | 10.1145/3639469 |
format | Article |
fullrecord | <record><control><sourceid>acm_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3639469</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3639469</sourcerecordid><originalsourceid>FETCH-LOGICAL-a239t-e03f5a826b1487190814edf10237da727d1a88ca87365c90221853c777647be83</originalsourceid><addsrcrecordid>eNo9kEtPwzAQhC0EEqUg7px842Sw4_iRY4koVCrifY62yaakSuLKNkX997S09DQz2k8rzRByKfiNEKm6lVpmqc6OyEAoJZi2Wh0fvDKn5CyEBedSq1QPyCL3LgT25Cpo6ShG7GPjevriMaBfwV_4aeIXfce2Zrnro4cQmxXSKYLvm35Oa-dp7rqlC1jR12_0a3YHWz_pYI70DaNvcAXtOTmpoQ14sdch-Rzff-SPbPr8MMlHUwaJzCJDLmsFNtEzkVojMm5FilUteCJNBSYxlQBrS7BmU6HMeJIIq2RpjNGpmaGVQ3K9-1tuq3msi6VvOvDrQvBiO1Gxn2hDXu1IKLsD9H_8BSazYN4</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image Retrieval</title><source>Access via ACM Digital Library</source><creator>Li, Shenshen ; Xu, Xing ; Jiang, Xun ; Shen, Fumin ; Sun, Zhe ; Cichocki, Andrzej</creator><creatorcontrib>Li, Shenshen ; Xu, Xing ; Jiang, Xun ; Shen, Fumin ; Sun, Zhe ; Cichocki, Andrzej</creatorcontrib><description>In this article, we study the challenging cross-modal image retrieval task, Composed Query-Based Image Retrieval (CQBIR), in which the query is not a single text query but a composed query, i.e., a reference image, and a modification text. Compared with the conventional cross-modal image-text retrieval task, the CQBIR is more challenging as it requires properly preserving and modifying the specific image region according to the multi-level semantic information learned from the multi-modal query. Most recent works focus on extracting preserved and modified information and compositing it into a unified representation. However, we observe that the preserved regions learned by the existing methods contain redundant modified information, inevitably degrading the overall retrieval performance. To this end, we propose a novel method termed Cross-Modal Attention Preservation (CMAP). Specifically, we first leverage the cross-level interaction to fully account for multi-granular semantic information, which aims to supplement the high-level semantics for effective image retrieval. Furthermore, different from conventional contrastive learning, our method introduces self-contrastive learning into learning preserved information, to prevent the model from confusing the attention for the preserved part with the modified part. Extensive experiments on three widely used CQBIR datasets, i.e., FashionIQ, Shoes, and Fashion200k, demonstrate that our proposed CMAP method significantly outperforms the current state-of-the-art methods on all the datasets. The anonymous implementation code of our CMAP method is available at https://github.com/CFM-MSG/Code_CMAP.</description><identifier>ISSN: 1551-6857</identifier><identifier>EISSN: 1551-6865</identifier><identifier>DOI: 10.1145/3639469</identifier><language>eng</language><publisher>New York, NY: ACM</publisher><subject>Computing methodologies ; Information retrieval ; Information systems ; Visual content-based indexing and retrieval</subject><ispartof>ACM transactions on multimedia computing communications and applications, 2024-03, Vol.20 (6), p.1-22, Article 165</ispartof><rights>Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-a239t-e03f5a826b1487190814edf10237da727d1a88ca87365c90221853c777647be83</cites><orcidid>0000-0002-6340-012X ; 0000-0002-6531-0769 ; 0000-0002-8364-7226 ; 0000-0003-2209-651X ; 0000-0001-7303-3231 ; 0000-0001-5685-3123</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://dl.acm.org/doi/pdf/10.1145/3639469$$EPDF$$P50$$Gacm$$H</linktopdf><link.rule.ids>314,780,784,2282,27924,27925,40196,76228</link.rule.ids></links><search><creatorcontrib>Li, Shenshen</creatorcontrib><creatorcontrib>Xu, Xing</creatorcontrib><creatorcontrib>Jiang, Xun</creatorcontrib><creatorcontrib>Shen, Fumin</creatorcontrib><creatorcontrib>Sun, Zhe</creatorcontrib><creatorcontrib>Cichocki, Andrzej</creatorcontrib><title>Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image Retrieval</title><title>ACM transactions on multimedia computing communications and applications</title><addtitle>ACM TOMM</addtitle><description>In this article, we study the challenging cross-modal image retrieval task, Composed Query-Based Image Retrieval (CQBIR), in which the query is not a single text query but a composed query, i.e., a reference image, and a modification text. Compared with the conventional cross-modal image-text retrieval task, the CQBIR is more challenging as it requires properly preserving and modifying the specific image region according to the multi-level semantic information learned from the multi-modal query. Most recent works focus on extracting preserved and modified information and compositing it into a unified representation. However, we observe that the preserved regions learned by the existing methods contain redundant modified information, inevitably degrading the overall retrieval performance. To this end, we propose a novel method termed Cross-Modal Attention Preservation (CMAP). Specifically, we first leverage the cross-level interaction to fully account for multi-granular semantic information, which aims to supplement the high-level semantics for effective image retrieval. Furthermore, different from conventional contrastive learning, our method introduces self-contrastive learning into learning preserved information, to prevent the model from confusing the attention for the preserved part with the modified part. Extensive experiments on three widely used CQBIR datasets, i.e., FashionIQ, Shoes, and Fashion200k, demonstrate that our proposed CMAP method significantly outperforms the current state-of-the-art methods on all the datasets. The anonymous implementation code of our CMAP method is available at https://github.com/CFM-MSG/Code_CMAP.</description><subject>Computing methodologies</subject><subject>Information retrieval</subject><subject>Information systems</subject><subject>Visual content-based indexing and retrieval</subject><issn>1551-6857</issn><issn>1551-6865</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNo9kEtPwzAQhC0EEqUg7px842Sw4_iRY4koVCrifY62yaakSuLKNkX997S09DQz2k8rzRByKfiNEKm6lVpmqc6OyEAoJZi2Wh0fvDKn5CyEBedSq1QPyCL3LgT25Cpo6ShG7GPjevriMaBfwV_4aeIXfce2Zrnro4cQmxXSKYLvm35Oa-dp7rqlC1jR12_0a3YHWz_pYI70DaNvcAXtOTmpoQ14sdch-Rzff-SPbPr8MMlHUwaJzCJDLmsFNtEzkVojMm5FilUteCJNBSYxlQBrS7BmU6HMeJIIq2RpjNGpmaGVQ3K9-1tuq3msi6VvOvDrQvBiO1Gxn2hDXu1IKLsD9H_8BSazYN4</recordid><startdate>20240308</startdate><enddate>20240308</enddate><creator>Li, Shenshen</creator><creator>Xu, Xing</creator><creator>Jiang, Xun</creator><creator>Shen, Fumin</creator><creator>Sun, Zhe</creator><creator>Cichocki, Andrzej</creator><general>ACM</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0002-6340-012X</orcidid><orcidid>https://orcid.org/0000-0002-6531-0769</orcidid><orcidid>https://orcid.org/0000-0002-8364-7226</orcidid><orcidid>https://orcid.org/0000-0003-2209-651X</orcidid><orcidid>https://orcid.org/0000-0001-7303-3231</orcidid><orcidid>https://orcid.org/0000-0001-5685-3123</orcidid></search><sort><creationdate>20240308</creationdate><title>Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image Retrieval</title><author>Li, Shenshen ; Xu, Xing ; Jiang, Xun ; Shen, Fumin ; Sun, Zhe ; Cichocki, Andrzej</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a239t-e03f5a826b1487190814edf10237da727d1a88ca87365c90221853c777647be83</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computing methodologies</topic><topic>Information retrieval</topic><topic>Information systems</topic><topic>Visual content-based indexing and retrieval</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Li, Shenshen</creatorcontrib><creatorcontrib>Xu, Xing</creatorcontrib><creatorcontrib>Jiang, Xun</creatorcontrib><creatorcontrib>Shen, Fumin</creatorcontrib><creatorcontrib>Sun, Zhe</creatorcontrib><creatorcontrib>Cichocki, Andrzej</creatorcontrib><collection>CrossRef</collection><jtitle>ACM transactions on multimedia computing communications and applications</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Li, Shenshen</au><au>Xu, Xing</au><au>Jiang, Xun</au><au>Shen, Fumin</au><au>Sun, Zhe</au><au>Cichocki, Andrzej</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image Retrieval</atitle><jtitle>ACM transactions on multimedia computing communications and applications</jtitle><stitle>ACM TOMM</stitle><date>2024-03-08</date><risdate>2024</risdate><volume>20</volume><issue>6</issue><spage>1</spage><epage>22</epage><pages>1-22</pages><artnum>165</artnum><issn>1551-6857</issn><eissn>1551-6865</eissn><abstract>In this article, we study the challenging cross-modal image retrieval task, Composed Query-Based Image Retrieval (CQBIR), in which the query is not a single text query but a composed query, i.e., a reference image, and a modification text. Compared with the conventional cross-modal image-text retrieval task, the CQBIR is more challenging as it requires properly preserving and modifying the specific image region according to the multi-level semantic information learned from the multi-modal query. Most recent works focus on extracting preserved and modified information and compositing it into a unified representation. However, we observe that the preserved regions learned by the existing methods contain redundant modified information, inevitably degrading the overall retrieval performance. To this end, we propose a novel method termed Cross-Modal Attention Preservation (CMAP). Specifically, we first leverage the cross-level interaction to fully account for multi-granular semantic information, which aims to supplement the high-level semantics for effective image retrieval. Furthermore, different from conventional contrastive learning, our method introduces self-contrastive learning into learning preserved information, to prevent the model from confusing the attention for the preserved part with the modified part. Extensive experiments on three widely used CQBIR datasets, i.e., FashionIQ, Shoes, and Fashion200k, demonstrate that our proposed CMAP method significantly outperforms the current state-of-the-art methods on all the datasets. The anonymous implementation code of our CMAP method is available at https://github.com/CFM-MSG/Code_CMAP.</abstract><cop>New York, NY</cop><pub>ACM</pub><doi>10.1145/3639469</doi><tpages>22</tpages><orcidid>https://orcid.org/0000-0002-6340-012X</orcidid><orcidid>https://orcid.org/0000-0002-6531-0769</orcidid><orcidid>https://orcid.org/0000-0002-8364-7226</orcidid><orcidid>https://orcid.org/0000-0003-2209-651X</orcidid><orcidid>https://orcid.org/0000-0001-7303-3231</orcidid><orcidid>https://orcid.org/0000-0001-5685-3123</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 1551-6857 |
ispartof | ACM transactions on multimedia computing communications and applications, 2024-03, Vol.20 (6), p.1-22, Article 165 |
issn | 1551-6857 1551-6865 |
language | eng |
recordid | cdi_crossref_primary_10_1145_3639469 |
source | Access via ACM Digital Library |
subjects | Computing methodologies Information retrieval Information systems Visual content-based indexing and retrieval |
title | Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image Retrieval |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-28T12%3A42%3A11IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Cross-Modal%20Attention%20Preservation%20with%20Self-Contrastive%20Learning%20for%20Composed%20Query-Based%20Image%20Retrieval&rft.jtitle=ACM%20transactions%20on%20multimedia%20computing%20communications%20and%20applications&rft.au=Li,%20Shenshen&rft.date=2024-03-08&rft.volume=20&rft.issue=6&rft.spage=1&rft.epage=22&rft.pages=1-22&rft.artnum=165&rft.issn=1551-6857&rft.eissn=1551-6865&rft_id=info:doi/10.1145/3639469&rft_dat=%3Cacm_cross%3E3639469%3C/acm_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |