LLM-enhanced Composed Image Retrieval: An Intent Uncertainty-aware Linguistic-Visual Dual Channel Matching Model

Composed image retrieval (CoIR) involves a multi-modal query of the reference image and modification text describing the desired changes, allowing users to express image retrieval intents flexibly and effectively. The key of CoIR lies in how to properly reason the search intent from the multi-modal...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:ACM transactions on information systems 2025-03, Vol.43 (2), p.1-30
Hauptverfasser: Ge, Hongfei, Jiang, Yuanchun, Sun, Jianshan, Yuan, Kun, Liu, Yezheng
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 30
container_issue 2
container_start_page 1
container_title ACM transactions on information systems
container_volume 43
creator Ge, Hongfei
Jiang, Yuanchun
Sun, Jianshan
Yuan, Kun
Liu, Yezheng
description Composed image retrieval (CoIR) involves a multi-modal query of the reference image and modification text describing the desired changes, allowing users to express image retrieval intents flexibly and effectively. The key of CoIR lies in how to properly reason the search intent from the multi-modal query. Existing work either aligns the composite embedding of the multi-modal query and the target image embedding in the visual domain through late-fusion or converts all images into text descriptions and leverage large language models (LLM) for text semantic reasoning. However, this single-modality reasoning approach fails to comprehensively and interpretably capture the users’ ambiguous and uncertain intents in the multi-modal queries, incurring the inconsistency between retrieved results and ground truth. Besides, the expensive manually-annotated datasets limit the further performance improvement of CoIR. To this end, this paper proposes an LLM-enhanced Intent Uncertainty-aware Linguistic-Visual Dual Channel Matching Model (IUDC), which combines the strengths of multi-modal late-fusion and LLMs for composed image retrieval. We first construct an LLM-based triplet augmentation strategy to generate more synthetic training triplets. Based on this, the core of IUDC consists of two matching channels: the semantic matching channel is responsible for intent reasoning on the aspect-level attributes extracted by an LLM, and the visual matching channel accounts for the fine-grained visual matching between multi-modal fusion embedding and target images. Considering the intent uncertainty presented in the multi-modal queries, we introduce Probability Distribution Encoder (PDE) to project the intents as probabilistic distributions in the two matching channels. Consequently, a mutually enhanced module is designed to share knowledge between the visual and semantic representations for better representation learning. Finally, the matching scores of two channels are added to retrieve the target image. Extensive experiments conducted on two real datasets demonstrate the effectiveness and superiority of our model. Notably, with the help of the proposed LLM-based triplet augmentation strategy, our model achieves a new record of state-of-the-art performance among all datasets.
doi_str_mv 10.1145/3699715
format Article
fullrecord <record><control><sourceid>acm_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1145_3699715</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3699715</sourcerecordid><originalsourceid>FETCH-LOGICAL-a845-55b6f1dc3a333880fde4120afbfea56a3dcaa8f35745df46dc9727469aef0fc33</originalsourceid><addsrcrecordid>eNo9kElPwzAUhC0EEqUg7px842Sw6yUOtypslVIhocI1evXSBiVOFaeg_ntctXCZmaf3aQ6D0DWjd4wJec9VnmdMnqARk1KTiVb6NGUqFNFM63N0EeMXpelWdIQ2ZTknLqwhGGdx0bWbLqYwa2Hl8Lsb-tp9Q_OApwHPwuDCgD8S2Q9Qh2FH4Ad6h8s6rLZ1HGpDPuu4hQY_7qVIpcE1eA6DWScEzzvrmkt05qGJ7uroY7R4floUr6R8e5kV05KAFpJIuVSeWcOBc6419dYJNqHgl96BVMCtAdCey0xI64WyJs8mmVA5OE-94XyMbg-1pu9i7J2vNn3dQr-rGK32O1XHnRJ5cyDBtP_Q3_MXOTpj1A</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>LLM-enhanced Composed Image Retrieval: An Intent Uncertainty-aware Linguistic-Visual Dual Channel Matching Model</title><source>ACM Digital Library</source><creator>Ge, Hongfei ; Jiang, Yuanchun ; Sun, Jianshan ; Yuan, Kun ; Liu, Yezheng</creator><creatorcontrib>Ge, Hongfei ; Jiang, Yuanchun ; Sun, Jianshan ; Yuan, Kun ; Liu, Yezheng</creatorcontrib><description>Composed image retrieval (CoIR) involves a multi-modal query of the reference image and modification text describing the desired changes, allowing users to express image retrieval intents flexibly and effectively. The key of CoIR lies in how to properly reason the search intent from the multi-modal query. Existing work either aligns the composite embedding of the multi-modal query and the target image embedding in the visual domain through late-fusion or converts all images into text descriptions and leverage large language models (LLM) for text semantic reasoning. However, this single-modality reasoning approach fails to comprehensively and interpretably capture the users’ ambiguous and uncertain intents in the multi-modal queries, incurring the inconsistency between retrieved results and ground truth. Besides, the expensive manually-annotated datasets limit the further performance improvement of CoIR. To this end, this paper proposes an LLM-enhanced Intent Uncertainty-aware Linguistic-Visual Dual Channel Matching Model (IUDC), which combines the strengths of multi-modal late-fusion and LLMs for composed image retrieval. We first construct an LLM-based triplet augmentation strategy to generate more synthetic training triplets. Based on this, the core of IUDC consists of two matching channels: the semantic matching channel is responsible for intent reasoning on the aspect-level attributes extracted by an LLM, and the visual matching channel accounts for the fine-grained visual matching between multi-modal fusion embedding and target images. Considering the intent uncertainty presented in the multi-modal queries, we introduce Probability Distribution Encoder (PDE) to project the intents as probabilistic distributions in the two matching channels. Consequently, a mutually enhanced module is designed to share knowledge between the visual and semantic representations for better representation learning. Finally, the matching scores of two channels are added to retrieve the target image. Extensive experiments conducted on two real datasets demonstrate the effectiveness and superiority of our model. Notably, with the help of the proposed LLM-based triplet augmentation strategy, our model achieves a new record of state-of-the-art performance among all datasets.</description><identifier>ISSN: 1046-8188</identifier><identifier>EISSN: 1558-2868</identifier><identifier>DOI: 10.1145/3699715</identifier><language>eng</language><publisher>New York, NY: ACM</publisher><subject>Image search ; Information systems</subject><ispartof>ACM transactions on information systems, 2025-03, Vol.43 (2), p.1-30</ispartof><rights>Copyright held by the owner/author(s).</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-a845-55b6f1dc3a333880fde4120afbfea56a3dcaa8f35745df46dc9727469aef0fc33</cites><orcidid>0000-0003-0886-3647 ; 0000-0002-9193-5236 ; 0009-0000-3044-6124 ; 0000-0002-6542-4483 ; 0000-0003-2981-5812</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Ge, Hongfei</creatorcontrib><creatorcontrib>Jiang, Yuanchun</creatorcontrib><creatorcontrib>Sun, Jianshan</creatorcontrib><creatorcontrib>Yuan, Kun</creatorcontrib><creatorcontrib>Liu, Yezheng</creatorcontrib><title>LLM-enhanced Composed Image Retrieval: An Intent Uncertainty-aware Linguistic-Visual Dual Channel Matching Model</title><title>ACM transactions on information systems</title><addtitle>ACM TOIS</addtitle><description>Composed image retrieval (CoIR) involves a multi-modal query of the reference image and modification text describing the desired changes, allowing users to express image retrieval intents flexibly and effectively. The key of CoIR lies in how to properly reason the search intent from the multi-modal query. Existing work either aligns the composite embedding of the multi-modal query and the target image embedding in the visual domain through late-fusion or converts all images into text descriptions and leverage large language models (LLM) for text semantic reasoning. However, this single-modality reasoning approach fails to comprehensively and interpretably capture the users’ ambiguous and uncertain intents in the multi-modal queries, incurring the inconsistency between retrieved results and ground truth. Besides, the expensive manually-annotated datasets limit the further performance improvement of CoIR. To this end, this paper proposes an LLM-enhanced Intent Uncertainty-aware Linguistic-Visual Dual Channel Matching Model (IUDC), which combines the strengths of multi-modal late-fusion and LLMs for composed image retrieval. We first construct an LLM-based triplet augmentation strategy to generate more synthetic training triplets. Based on this, the core of IUDC consists of two matching channels: the semantic matching channel is responsible for intent reasoning on the aspect-level attributes extracted by an LLM, and the visual matching channel accounts for the fine-grained visual matching between multi-modal fusion embedding and target images. Considering the intent uncertainty presented in the multi-modal queries, we introduce Probability Distribution Encoder (PDE) to project the intents as probabilistic distributions in the two matching channels. Consequently, a mutually enhanced module is designed to share knowledge between the visual and semantic representations for better representation learning. Finally, the matching scores of two channels are added to retrieve the target image. Extensive experiments conducted on two real datasets demonstrate the effectiveness and superiority of our model. Notably, with the help of the proposed LLM-based triplet augmentation strategy, our model achieves a new record of state-of-the-art performance among all datasets.</description><subject>Image search</subject><subject>Information systems</subject><issn>1046-8188</issn><issn>1558-2868</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2025</creationdate><recordtype>article</recordtype><recordid>eNo9kElPwzAUhC0EEqUg7px842Sw6yUOtypslVIhocI1evXSBiVOFaeg_ntctXCZmaf3aQ6D0DWjd4wJec9VnmdMnqARk1KTiVb6NGUqFNFM63N0EeMXpelWdIQ2ZTknLqwhGGdx0bWbLqYwa2Hl8Lsb-tp9Q_OApwHPwuDCgD8S2Q9Qh2FH4Ad6h8s6rLZ1HGpDPuu4hQY_7qVIpcE1eA6DWScEzzvrmkt05qGJ7uroY7R4floUr6R8e5kV05KAFpJIuVSeWcOBc6419dYJNqHgl96BVMCtAdCey0xI64WyJs8mmVA5OE-94XyMbg-1pu9i7J2vNn3dQr-rGK32O1XHnRJ5cyDBtP_Q3_MXOTpj1A</recordid><startdate>20250331</startdate><enddate>20250331</enddate><creator>Ge, Hongfei</creator><creator>Jiang, Yuanchun</creator><creator>Sun, Jianshan</creator><creator>Yuan, Kun</creator><creator>Liu, Yezheng</creator><general>ACM</general><scope>AAYXX</scope><scope>CITATION</scope><orcidid>https://orcid.org/0000-0003-0886-3647</orcidid><orcidid>https://orcid.org/0000-0002-9193-5236</orcidid><orcidid>https://orcid.org/0009-0000-3044-6124</orcidid><orcidid>https://orcid.org/0000-0002-6542-4483</orcidid><orcidid>https://orcid.org/0000-0003-2981-5812</orcidid></search><sort><creationdate>20250331</creationdate><title>LLM-enhanced Composed Image Retrieval: An Intent Uncertainty-aware Linguistic-Visual Dual Channel Matching Model</title><author>Ge, Hongfei ; Jiang, Yuanchun ; Sun, Jianshan ; Yuan, Kun ; Liu, Yezheng</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a845-55b6f1dc3a333880fde4120afbfea56a3dcaa8f35745df46dc9727469aef0fc33</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2025</creationdate><topic>Image search</topic><topic>Information systems</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Ge, Hongfei</creatorcontrib><creatorcontrib>Jiang, Yuanchun</creatorcontrib><creatorcontrib>Sun, Jianshan</creatorcontrib><creatorcontrib>Yuan, Kun</creatorcontrib><creatorcontrib>Liu, Yezheng</creatorcontrib><collection>CrossRef</collection><jtitle>ACM transactions on information systems</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Ge, Hongfei</au><au>Jiang, Yuanchun</au><au>Sun, Jianshan</au><au>Yuan, Kun</au><au>Liu, Yezheng</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>LLM-enhanced Composed Image Retrieval: An Intent Uncertainty-aware Linguistic-Visual Dual Channel Matching Model</atitle><jtitle>ACM transactions on information systems</jtitle><stitle>ACM TOIS</stitle><date>2025-03-31</date><risdate>2025</risdate><volume>43</volume><issue>2</issue><spage>1</spage><epage>30</epage><pages>1-30</pages><issn>1046-8188</issn><eissn>1558-2868</eissn><abstract>Composed image retrieval (CoIR) involves a multi-modal query of the reference image and modification text describing the desired changes, allowing users to express image retrieval intents flexibly and effectively. The key of CoIR lies in how to properly reason the search intent from the multi-modal query. Existing work either aligns the composite embedding of the multi-modal query and the target image embedding in the visual domain through late-fusion or converts all images into text descriptions and leverage large language models (LLM) for text semantic reasoning. However, this single-modality reasoning approach fails to comprehensively and interpretably capture the users’ ambiguous and uncertain intents in the multi-modal queries, incurring the inconsistency between retrieved results and ground truth. Besides, the expensive manually-annotated datasets limit the further performance improvement of CoIR. To this end, this paper proposes an LLM-enhanced Intent Uncertainty-aware Linguistic-Visual Dual Channel Matching Model (IUDC), which combines the strengths of multi-modal late-fusion and LLMs for composed image retrieval. We first construct an LLM-based triplet augmentation strategy to generate more synthetic training triplets. Based on this, the core of IUDC consists of two matching channels: the semantic matching channel is responsible for intent reasoning on the aspect-level attributes extracted by an LLM, and the visual matching channel accounts for the fine-grained visual matching between multi-modal fusion embedding and target images. Considering the intent uncertainty presented in the multi-modal queries, we introduce Probability Distribution Encoder (PDE) to project the intents as probabilistic distributions in the two matching channels. Consequently, a mutually enhanced module is designed to share knowledge between the visual and semantic representations for better representation learning. Finally, the matching scores of two channels are added to retrieve the target image. Extensive experiments conducted on two real datasets demonstrate the effectiveness and superiority of our model. Notably, with the help of the proposed LLM-based triplet augmentation strategy, our model achieves a new record of state-of-the-art performance among all datasets.</abstract><cop>New York, NY</cop><pub>ACM</pub><doi>10.1145/3699715</doi><tpages>30</tpages><orcidid>https://orcid.org/0000-0003-0886-3647</orcidid><orcidid>https://orcid.org/0000-0002-9193-5236</orcidid><orcidid>https://orcid.org/0009-0000-3044-6124</orcidid><orcidid>https://orcid.org/0000-0002-6542-4483</orcidid><orcidid>https://orcid.org/0000-0003-2981-5812</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 1046-8188
ispartof ACM transactions on information systems, 2025-03, Vol.43 (2), p.1-30
issn 1046-8188
1558-2868
language eng
recordid cdi_crossref_primary_10_1145_3699715
source ACM Digital Library
subjects Image search
Information systems
title LLM-enhanced Composed Image Retrieval: An Intent Uncertainty-aware Linguistic-Visual Dual Channel Matching Model
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-07T16%3A30%3A00IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-acm_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=LLM-enhanced%20Composed%20Image%20Retrieval:%20An%20Intent%20Uncertainty-aware%20Linguistic-Visual%20Dual%20Channel%20Matching%20Model&rft.jtitle=ACM%20transactions%20on%20information%20systems&rft.au=Ge,%20Hongfei&rft.date=2025-03-31&rft.volume=43&rft.issue=2&rft.spage=1&rft.epage=30&rft.pages=1-30&rft.issn=1046-8188&rft.eissn=1558-2868&rft_id=info:doi/10.1145/3699715&rft_dat=%3Cacm_cross%3E3699715%3C/acm_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true