LLM-enhanced Composed Image Retrieval: An Intent Uncertainty-aware Linguistic-Visual Dual Channel Matching Model

Composed image retrieval (CoIR) involves a multi-modal query of the reference image and modification text describing the desired changes, allowing users to express image retrieval intents flexibly and effectively. The key of CoIR lies in how to properly reason the search intent from the multi-modal...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	ACM transactions on information systems 2025-03, Vol.43 (2), p.1-30
Hauptverfasser:	Ge, Hongfei, Jiang, Yuanchun, Sun, Jianshan, Yuan, Kun, Liu, Yezheng
Format:	Artikel
Sprache:	eng
Schlagworte:	Image search Information systems
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	30
container_issue	2
container_start_page	1
container_title	ACM transactions on information systems
container_volume	43
creator	Ge, Hongfei Jiang, Yuanchun Sun, Jianshan Yuan, Kun Liu, Yezheng
description	Composed image retrieval (CoIR) involves a multi-modal query of the reference image and modification text describing the desired changes, allowing users to express image retrieval intents flexibly and effectively. The key of CoIR lies in how to properly reason the search intent from the multi-modal query. Existing work either aligns the composite embedding of the multi-modal query and the target image embedding in the visual domain through late-fusion or converts all images into text descriptions and leverage large language models (LLM) for text semantic reasoning. However, this single-modality reasoning approach fails to comprehensively and interpretably capture the users’ ambiguous and uncertain intents in the multi-modal queries, incurring the inconsistency between retrieved results and ground truth. Besides, the expensive manually-annotated datasets limit the further performance improvement of CoIR. To this end, this paper proposes an LLM-enhanced Intent Uncertainty-aware Linguistic-Visual Dual Channel Matching Model (IUDC), which combines the strengths of multi-modal late-fusion and LLMs for composed image retrieval. We first construct an LLM-based triplet augmentation strategy to generate more synthetic training triplets. Based on this, the core of IUDC consists of two matching channels: the semantic matching channel is responsible for intent reasoning on the aspect-level attributes extracted by an LLM, and the visual matching channel accounts for the fine-grained visual matching between multi-modal fusion embedding and target images. Considering the intent uncertainty presented in the multi-modal queries, we introduce Probability Distribution Encoder (PDE) to project the intents as probabilistic distributions in the two matching channels. Consequently, a mutually enhanced module is designed to share knowledge between the visual and semantic representations for better representation learning. Finally, the matching scores of two channels are added to retrieve the target image. Extensive experiments conducted on two real datasets demonstrate the effectiveness and superiority of our model. Notably, with the help of the proposed LLM-based triplet augmentation strategy, our model achieves a new record of state-of-the-art performance among all datasets.
doi_str_mv	10.1145/3699715
format	Article
fullrecord	<record><control><sourceid>acm_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3699715</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3699715</sourcerecordid><originalsourceid>FETCH-LOGICAL-a845-55b6f1dc3a333880fde4120afbfea56a3dcaa8f35745df46dc9727469aef0fc33</originalsourceid><addsrcrecordid>eNo9kElPwzAUhC0EEqUg7px842Sw6yUOtypslVIhocI1evXSBiVOFaeg_ntctXCZmaf3aQ6D0DWjd4wJec9VnmdMnqARk1KTiVb6NGUqFNFM63N0EeMXpelWdIQ2ZTknLqwhGGdx0bWbLqYwa2Hl8Lsb-tp9Q_OApwHPwuDCgD8S2Q9Qh2FH4Ad6h8s6rLZ1HGpDPuu4hQY_7qVIpcE1eA6DWScEzzvrmkt05qGJ7uroY7R4floUr6R8e5kV05KAFpJIuVSeWcOBc6419dYJNqHgl96BVMCtAdCey0xI64WyJs8mmVA5OE-94XyMbg-1pu9i7J2vNn3dQr-rGK32O1XHnRJ5cyDBtP_Q3_MXOTpj1A</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>LLM-enhanced Composed Image Retrieval: An Intent Uncertainty-aware Linguistic-Visual Dual Channel Matching Model</title><source>ACM Digital Library</source><creator>Ge, Hongfei ; Jiang, Yuanchun ; Sun, Jianshan ; Yuan, Kun ; Liu, Yezheng</creator><creatorcontrib>Ge, Hongfei ; Jiang, Yuanchun ; Sun, Jianshan ; Yuan, Kun ; Liu, Yezheng</creatorcontrib><description>Composed image retrieval (CoIR) involves a multi-modal query of the reference image and modification text describing the desired changes, allowing users to express image retrieval intents flexibly and effectively. The key of CoIR lies in how to properly reason the search intent from the multi-modal query. Existing work either aligns the composite embedding of the multi-modal query and the target image embedding in the visual domain through late-fusion or converts all images into text descriptions and leverage large language models (LLM) for text semantic reasoning. However, this single-modality reasoning approach fails to comprehensively and interpretably capture the users’ ambiguous and uncertain intents in the multi-modal queries, incurring the inconsistency between retrieved results and ground truth. Besides, the expensive manually-annotated datasets limit the further performance improvement of CoIR. To this end, this paper proposes an LLM-enhanced Intent Uncertainty-aware Linguistic-Visual Dual Channel Matching Model (IUDC), which combines the strengths of multi-modal late-fusion and LLMs for composed image retrieval. We first construct an LLM-based triplet augmentation strategy to generate more synthetic training triplets. Based on this, the core of IUDC consists of two matching channels: the semantic matching channel is responsible for intent reasoning on the aspect-level attributes extracted by an LLM, and the visual matching channel accounts for the fine-grained visual matching between multi-modal fusion embedding and target images. Considering the intent uncertainty presented in the multi-modal queries, we introduce Probability Distribution Encoder (PDE) to project the intents as probabilistic distributions in the two matching channels. Consequently, a mutually enhanced module is designed to share knowledge between the visual and semantic representations for better representation learning. Finally, the matching scores of two channels are added to retrieve the target image. Extensive experiments conducted on two real datasets demonstrate the effectiveness and superiority of our model. Notably, with the help of the proposed LLM-based triplet augmentation strategy, our model achieves a new record of state-of-the-art performance among all datasets.</description><identifier>ISSN: 1046-8188</identifier><identifier>EISSN: 1558-2868</identifier><identifier>DOI: 10.1145/3699715</identifier><language>eng</language><publisher>New York, NY: ACM</publisher><subject>Image search ; Information systems</subject><ispartof>ACM transactions on information systems, 2025-03, Vol.43 (2), p.1-30</ispartof><rights>Copyright held by the owner/author(s).</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-a845-55b6f1dc3a333880fde4120afbfea56a3dcaa8f35745df46dc9727469aef0fc33</cites><orcidid>0000-0003-0886-3647 ; 0000-0002-9193-5236 ; 0009-0000-3044-6124 ; 0000-0002-6542-4483 ; 0000-0003-2981-5812</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Ge, Hongfei</creatorcontrib><creatorcontrib>Jiang, Yuanchun</creatorcontrib><creatorcontrib>Sun, Jianshan</creatorcontrib><creatorcontrib>Yuan, Kun</creatorcontrib><creatorcontrib>Liu, Yezheng</creatorcontrib><title>LLM-enhanced Composed Image Retrieval: An Intent Uncertainty-aware Linguistic-Visual Dual Channel Matching Model</title><title>ACM transactions on information systems</title><addtitle>ACM TOIS</addtitle><description>Composed image retrieval (CoIR) involves a multi-modal query of the reference image and modification text describing the desired changes, allowing users to express image retrieval intents flexibly and effectively. The key of CoIR lies in how to properly reason the search intent from the multi-modal query. Existing work either aligns the composite embedding of the multi-modal query and the target image embedding in the visual domain through late-fusion or converts all images into text descriptions and leverage large language models (LLM) for text semantic reasoning. However, this single-modality reasoning approach fails to comprehensively and interpretably capture the users’ ambiguous and uncertain intents in the multi-modal queries, incurring the inconsistency between retrieved results and ground truth. Besides, the expensive manually-annotated datasets limit the further performance improvement of CoIR. To this end, this paper proposes an LLM-enhanced Intent Uncertainty-aware Linguistic-Visual Dual Channel Matching Model (IUDC), which combines the strengths of multi-modal late-fusion and LLMs for composed image retrieval. We first construct an LLM-based triplet augmentation strategy to generate more synthetic training triplets. Based on this, the core of IUDC consists of two matching channels: the semantic matching channel is responsible for intent reasoning on the aspect-level attributes extracted by an LLM, and the visual matching channel accounts for the fine-grained visual matching between multi-modal fusion embedding and target images. Considering the intent uncertainty presented in the multi-modal queries, we introduce Probability Distribution Encoder (PDE) to project the intents as probabilistic distributions in the two matching channels. Consequently, a mutually enhanced module is designed to share knowledge between the visual and semantic representations for better representation learning. Finally, the matching scores of two channels are added to retrieve the target image. Extensive experiments conducted on two real datasets demonstrate the effectiveness and superiority of our model. Notably, with the help of the proposed LLM-based triplet augmentation strategy, our model achieves a new record of state-of-the-art performance among all datasets.</description><subject>Image search</subject><subject>Information systems</subject><issn>1046-8188</issn><issn>1558-2868</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2025</creationdate><recordtype>article</recordtype><recordid>eNo9kElPwzAUhC0EEqUg7px842Sw6yUOtypslVIhocI1evXSBiVOFaeg_ntctXCZmaf3aQ6D0DWjd4wJec9VnmdMnqARk1KTiVb6NGUqFNFM63N0EeMXpelWdIQ2ZTknLqwhGGdx0bWbLqYwa2Hl8Lsb-tp9Q_OApwHPwuDCgD8S2Q9Qh2FH4Ad6h8s6rLZ1HGpDPuu4hQY_7qVIpcE1eA6DWScEzzvrmkt05qGJ7uroY7R4floUr6R8e5kV05KAFpJIuVSeWcOBc6419dYJNqHgl96BVMCtAdCey0xI64WyJs8mmVA5OE-94XyMbg-1pu9i7J2vNn3dQr-rGK32O1XHnRJ5cyDBtP_Q3_MXOTpj1A</recordid><startdate>20250331</startdate><enddate>20250331</enddate><creator>Ge, Hongfei</creator><creator>Jiang, Yuanchun</creator><creator>Sun, Jianshan</creator><creator>Yuan, Kun</creator><creator>Liu, Yezheng</creator><general>ACM</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0003-0886-3647</orcidid><orcidid>https://orcid.org/0000-0002-9193-5236</orcidid><orcidid>https://orcid.org/0009-0000-3044-6124</orcidid><orcidid>https://orcid.org/0000-0002-6542-4483</orcidid><orcidid>https://orcid.org/0000-0003-2981-5812</orcidid></search><sort><creationdate>20250331</creationdate><title>LLM-enhanced Composed Image Retrieval: An Intent Uncertainty-aware Linguistic-Visual Dual Channel Matching Model</title><author>Ge, Hongfei ; Jiang, Yuanchun ; Sun, Jianshan ; Yuan, Kun ; Liu, Yezheng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a845-55b6f1dc3a333880fde4120afbfea56a3dcaa8f35745df46dc9727469aef0fc33</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2025</creationdate><topic>Image search</topic><topic>Information systems</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ge, Hongfei</creatorcontrib><creatorcontrib>Jiang, Yuanchun</creatorcontrib><creatorcontrib>Sun, Jianshan</creatorcontrib><creatorcontrib>Yuan, Kun</creatorcontrib><creatorcontrib>Liu, Yezheng</creatorcontrib><collection>CrossRef</collection><jtitle>ACM transactions on information systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ge, Hongfei</au><au>Jiang, Yuanchun</au><au>Sun, Jianshan</au><au>Yuan, Kun</au><au>Liu, Yezheng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>LLM-enhanced Composed Image Retrieval: An Intent Uncertainty-aware Linguistic-Visual Dual Channel Matching Model</atitle><jtitle>ACM transactions on information systems</jtitle><stitle>ACM TOIS</stitle><date>2025-03-31</date><risdate>2025</risdate><volume>43</volume><issue>2</issue><spage>1</spage><epage>30</epage><pages>1-30</pages><issn>1046-8188</issn><eissn>1558-2868</eissn><abstract>Composed image retrieval (CoIR) involves a multi-modal query of the reference image and modification text describing the desired changes, allowing users to express image retrieval intents flexibly and effectively. The key of CoIR lies in how to properly reason the search intent from the multi-modal query. Existing work either aligns the composite embedding of the multi-modal query and the target image embedding in the visual domain through late-fusion or converts all images into text descriptions and leverage large language models (LLM) for text semantic reasoning. However, this single-modality reasoning approach fails to comprehensively and interpretably capture the users’ ambiguous and uncertain intents in the multi-modal queries, incurring the inconsistency between retrieved results and ground truth. Besides, the expensive manually-annotated datasets limit the further performance improvement of CoIR. To this end, this paper proposes an LLM-enhanced Intent Uncertainty-aware Linguistic-Visual Dual Channel Matching Model (IUDC), which combines the strengths of multi-modal late-fusion and LLMs for composed image retrieval. We first construct an LLM-based triplet augmentation strategy to generate more synthetic training triplets. Based on this, the core of IUDC consists of two matching channels: the semantic matching channel is responsible for intent reasoning on the aspect-level attributes extracted by an LLM, and the visual matching channel accounts for the fine-grained visual matching between multi-modal fusion embedding and target images. Considering the intent uncertainty presented in the multi-modal queries, we introduce Probability Distribution Encoder (PDE) to project the intents as probabilistic distributions in the two matching channels. Consequently, a mutually enhanced module is designed to share knowledge between the visual and semantic representations for better representation learning. Finally, the matching scores of two channels are added to retrieve the target image. Extensive experiments conducted on two real datasets demonstrate the effectiveness and superiority of our model. Notably, with the help of the proposed LLM-based triplet augmentation strategy, our model achieves a new record of state-of-the-art performance among all datasets.</abstract><cop>New York, NY</cop><pub>ACM</pub><doi>10.1145/3699715</doi><tpages>30</tpages><orcidid>https://orcid.org/0000-0003-0886-3647</orcidid><orcidid>https://orcid.org/0000-0002-9193-5236</orcidid><orcidid>https://orcid.org/0009-0000-3044-6124</orcidid><orcidid>https://orcid.org/0000-0002-6542-4483</orcidid><orcidid>https://orcid.org/0000-0003-2981-5812</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 1046-8188
ispartof	ACM transactions on information systems, 2025-03, Vol.43 (2), p.1-30
issn	1046-8188 1558-2868
language	eng
recordid	cdi_crossref_primary_10_1145_3699715
source	ACM Digital Library
subjects	Image search Information systems
title	LLM-enhanced Composed Image Retrieval: An Intent Uncertainty-aware Linguistic-Visual Dual Channel Matching Model
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T16%3A30%3A00IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=LLM-enhanced%20Composed%20Image%20Retrieval:%20An%20Intent%20Uncertainty-aware%20Linguistic-Visual%20Dual%20Channel%20Matching%20Model&rft.jtitle=ACM%20transactions%20on%20information%20systems&rft.au=Ge,%20Hongfei&rft.date=2025-03-31&rft.volume=43&rft.issue=2&rft.spage=1&rft.epage=30&rft.pages=1-30&rft.issn=1046-8188&rft.eissn=1558-2868&rft_id=info:doi/10.1145/3699715&rft_dat=%3Cacm_cross%3E3699715%3C/acm_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true