Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning

Compositional zero-shot learning (CZSL) aims to recognize novel compositions of attributes and objects learned from seen compositions. Previous works disentangle attribute and object by extracting shared and exclusive parts between image pairs sharing the same attribute (object), as well as aligning...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-11
Hauptverfasser:	Yan, Xudong, Feng, Songhe, Zhang, Yang, Yang, Jian, Lin, Yueguan, Fei, Haojun
Format:	Artikel
Sprache:	eng
Schlagworte:	Composition Entanglement Large language models Object recognition Smoothing Words (language) Zero-shot learning
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Yan, Xudong Feng, Songhe Zhang, Yang Yang, Jian Lin, Yueguan Fei, Haojun
description	Compositional zero-shot learning (CZSL) aims to recognize novel compositions of attributes and objects learned from seen compositions. Previous works disentangle attribute and object by extracting shared and exclusive parts between image pairs sharing the same attribute (object), as well as aligning them with pretrained word embeddings to improve unseen attribute-object recognition. Despite the significant achievements of existing efforts, they are hampered by three limitations: (1) the efficacy of disentanglement is compromised due to the influence of the background and the intricate entanglement of attribute with object in the same parts. (2) existing word embeddings fail to capture complex multimodal semantic information. (3) overconfidence exhibited by existing models in seen compositions hinders their generalization to novel compositions. Being aware of these, we propose a novel framework named Multimodal Large Language Model (MLLM) embeddings and attribute smoothing guided disentanglement (TRIDENT) for CZSL. First, we leverage feature adaptive aggregation modules to mitigate the impact of background, and utilize learnable condition masks to capture multigranularity features for disentanglement. Then, the last hidden states of MLLM are employed as word embeddings for their superior representation capabilities. Moreover, we propose attribute smoothing with auxiliary attributes generated by Large Language Model (LLM) for seen compositions, addressing the issue of overconfidence by encouraging the model to learn more attributes in one given composition. Extensive experiments demonstrate that TRIDENT achieves state-of-the-art performance on three benchmarks.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3130968274</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3130968274</sourcerecordid><originalsourceid>FETCH-proquest_journals_31309682743</originalsourceid><addsrcrecordid>eNqNy08LgjAcxvERBEn5Hgadhbn5r2OI0WGeCoIuMnHqRPezbfb6M-gFdHr4wufZII8yFgZZROkO-dYOhBCapDSOmYceXL6lEZ3SHS45L3Ex1bJp1rRY6AafnTOqXpzEtwnA9V_XgsE5TDNY5RRoMeKnNBDcenCYS2H0ig5o24rRSv-3e3S8FPf8GswGXou0rhpgMevXVixk5JRkNI3Yf-oDJstBXA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3130968274</pqid></control><display><type>article</type><title>Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning</title><source>Freely Accessible Journals</source><creator>Yan, Xudong ; Feng, Songhe ; Zhang, Yang ; Yang, Jian ; Lin, Yueguan ; Fei, Haojun</creator><creatorcontrib>Yan, Xudong ; Feng, Songhe ; Zhang, Yang ; Yang, Jian ; Lin, Yueguan ; Fei, Haojun</creatorcontrib><description>Compositional zero-shot learning (CZSL) aims to recognize novel compositions of attributes and objects learned from seen compositions. Previous works disentangle attribute and object by extracting shared and exclusive parts between image pairs sharing the same attribute (object), as well as aligning them with pretrained word embeddings to improve unseen attribute-object recognition. Despite the significant achievements of existing efforts, they are hampered by three limitations: (1) the efficacy of disentanglement is compromised due to the influence of the background and the intricate entanglement of attribute with object in the same parts. (2) existing word embeddings fail to capture complex multimodal semantic information. (3) overconfidence exhibited by existing models in seen compositions hinders their generalization to novel compositions. Being aware of these, we propose a novel framework named Multimodal Large Language Model (MLLM) embeddings and attribute smoothing guided disentanglement (TRIDENT) for CZSL. First, we leverage feature adaptive aggregation modules to mitigate the impact of background, and utilize learnable condition masks to capture multigranularity features for disentanglement. Then, the last hidden states of MLLM are employed as word embeddings for their superior representation capabilities. Moreover, we propose attribute smoothing with auxiliary attributes generated by Large Language Model (LLM) for seen compositions, addressing the issue of overconfidence by encouraging the model to learn more attributes in one given composition. Extensive experiments demonstrate that TRIDENT achieves state-of-the-art performance on three benchmarks.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Composition ; Entanglement ; Large language models ; Object recognition ; Smoothing ; Words (language) ; Zero-shot learning</subject><ispartof>arXiv.org, 2024-11</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>778,782</link.rule.ids></links><search><creatorcontrib>Yan, Xudong</creatorcontrib><creatorcontrib>Feng, Songhe</creatorcontrib><creatorcontrib>Zhang, Yang</creatorcontrib><creatorcontrib>Yang, Jian</creatorcontrib><creatorcontrib>Lin, Yueguan</creatorcontrib><creatorcontrib>Fei, Haojun</creatorcontrib><title>Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning</title><title>arXiv.org</title><description>Compositional zero-shot learning (CZSL) aims to recognize novel compositions of attributes and objects learned from seen compositions. Previous works disentangle attribute and object by extracting shared and exclusive parts between image pairs sharing the same attribute (object), as well as aligning them with pretrained word embeddings to improve unseen attribute-object recognition. Despite the significant achievements of existing efforts, they are hampered by three limitations: (1) the efficacy of disentanglement is compromised due to the influence of the background and the intricate entanglement of attribute with object in the same parts. (2) existing word embeddings fail to capture complex multimodal semantic information. (3) overconfidence exhibited by existing models in seen compositions hinders their generalization to novel compositions. Being aware of these, we propose a novel framework named Multimodal Large Language Model (MLLM) embeddings and attribute smoothing guided disentanglement (TRIDENT) for CZSL. First, we leverage feature adaptive aggregation modules to mitigate the impact of background, and utilize learnable condition masks to capture multigranularity features for disentanglement. Then, the last hidden states of MLLM are employed as word embeddings for their superior representation capabilities. Moreover, we propose attribute smoothing with auxiliary attributes generated by Large Language Model (LLM) for seen compositions, addressing the issue of overconfidence by encouraging the model to learn more attributes in one given composition. Extensive experiments demonstrate that TRIDENT achieves state-of-the-art performance on three benchmarks.</description><subject>Composition</subject><subject>Entanglement</subject><subject>Large language models</subject><subject>Object recognition</subject><subject>Smoothing</subject><subject>Words (language)</subject><subject>Zero-shot learning</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNy08LgjAcxvERBEn5Hgadhbn5r2OI0WGeCoIuMnHqRPezbfb6M-gFdHr4wufZII8yFgZZROkO-dYOhBCapDSOmYceXL6lEZ3SHS45L3Ex1bJp1rRY6AafnTOqXpzEtwnA9V_XgsE5TDNY5RRoMeKnNBDcenCYS2H0ig5o24rRSv-3e3S8FPf8GswGXou0rhpgMevXVixk5JRkNI3Yf-oDJstBXA</recordid><startdate>20241118</startdate><enddate>20241118</enddate><creator>Yan, Xudong</creator><creator>Feng, Songhe</creator><creator>Zhang, Yang</creator><creator>Yang, Jian</creator><creator>Lin, Yueguan</creator><creator>Fei, Haojun</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241118</creationdate><title>Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning</title><author>Yan, Xudong ; Feng, Songhe ; Zhang, Yang ; Yang, Jian ; Lin, Yueguan ; Fei, Haojun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31309682743</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Composition</topic><topic>Entanglement</topic><topic>Large language models</topic><topic>Object recognition</topic><topic>Smoothing</topic><topic>Words (language)</topic><topic>Zero-shot learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Yan, Xudong</creatorcontrib><creatorcontrib>Feng, Songhe</creatorcontrib><creatorcontrib>Zhang, Yang</creatorcontrib><creatorcontrib>Yang, Jian</creatorcontrib><creatorcontrib>Lin, Yueguan</creatorcontrib><creatorcontrib>Fei, Haojun</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Yan, Xudong</au><au>Feng, Songhe</au><au>Zhang, Yang</au><au>Yang, Jian</au><au>Lin, Yueguan</au><au>Fei, Haojun</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning</atitle><jtitle>arXiv.org</jtitle><date>2024-11-18</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Compositional zero-shot learning (CZSL) aims to recognize novel compositions of attributes and objects learned from seen compositions. Previous works disentangle attribute and object by extracting shared and exclusive parts between image pairs sharing the same attribute (object), as well as aligning them with pretrained word embeddings to improve unseen attribute-object recognition. Despite the significant achievements of existing efforts, they are hampered by three limitations: (1) the efficacy of disentanglement is compromised due to the influence of the background and the intricate entanglement of attribute with object in the same parts. (2) existing word embeddings fail to capture complex multimodal semantic information. (3) overconfidence exhibited by existing models in seen compositions hinders their generalization to novel compositions. Being aware of these, we propose a novel framework named Multimodal Large Language Model (MLLM) embeddings and attribute smoothing guided disentanglement (TRIDENT) for CZSL. First, we leverage feature adaptive aggregation modules to mitigate the impact of background, and utilize learnable condition masks to capture multigranularity features for disentanglement. Then, the last hidden states of MLLM are employed as word embeddings for their superior representation capabilities. Moreover, we propose attribute smoothing with auxiliary attributes generated by Large Language Model (LLM) for seen compositions, addressing the issue of overconfidence by encouraging the model to learn more attributes in one given composition. Extensive experiments demonstrate that TRIDENT achieves state-of-the-art performance on three benchmarks.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-11
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3130968274
source	Freely Accessible Journals
subjects	Composition Entanglement Large language models Object recognition Smoothing Words (language) Zero-shot learning
title	Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-16T17%3A37%3A42IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Leveraging%20MLLM%20Embeddings%20and%20Attribute%20Smoothing%20for%20Compositional%20Zero-Shot%20Learning&rft.jtitle=arXiv.org&rft.au=Yan,%20Xudong&rft.date=2024-11-18&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3130968274%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3130968274&rft_id=info:pmid/&rfr_iscdi=true