Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning

Compositional zero-shot learning (CZSL) aims to recognize novel compositions of attributes and objects learned from seen compositions. Previous works disentangle attribute and object by extracting shared and exclusive parts between image pairs sharing the same attribute (object), as well as aligning...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-11
Hauptverfasser: Yan, Xudong, Feng, Songhe, Zhang, Yang, Yang, Jian, Lin, Yueguan, Fei, Haojun
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Yan, Xudong
Feng, Songhe
Zhang, Yang
Yang, Jian
Lin, Yueguan
Fei, Haojun
description Compositional zero-shot learning (CZSL) aims to recognize novel compositions of attributes and objects learned from seen compositions. Previous works disentangle attribute and object by extracting shared and exclusive parts between image pairs sharing the same attribute (object), as well as aligning them with pretrained word embeddings to improve unseen attribute-object recognition. Despite the significant achievements of existing efforts, they are hampered by three limitations: (1) the efficacy of disentanglement is compromised due to the influence of the background and the intricate entanglement of attribute with object in the same parts. (2) existing word embeddings fail to capture complex multimodal semantic information. (3) overconfidence exhibited by existing models in seen compositions hinders their generalization to novel compositions. Being aware of these, we propose a novel framework named Multimodal Large Language Model (MLLM) embeddings and attribute smoothing guided disentanglement (TRIDENT) for CZSL. First, we leverage feature adaptive aggregation modules to mitigate the impact of background, and utilize learnable condition masks to capture multigranularity features for disentanglement. Then, the last hidden states of MLLM are employed as word embeddings for their superior representation capabilities. Moreover, we propose attribute smoothing with auxiliary attributes generated by Large Language Model (LLM) for seen compositions, addressing the issue of overconfidence by encouraging the model to learn more attributes in one given composition. Extensive experiments demonstrate that TRIDENT achieves state-of-the-art performance on three benchmarks.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3130968274</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3130968274</sourcerecordid><originalsourceid>FETCH-proquest_journals_31309682743</originalsourceid><addsrcrecordid>eNqNy08LgjAcxvERBEn5Hgadhbn5r2OI0WGeCoIuMnHqRPezbfb6M-gFdHr4wufZII8yFgZZROkO-dYOhBCapDSOmYceXL6lEZ3SHS45L3Ex1bJp1rRY6AafnTOqXpzEtwnA9V_XgsE5TDNY5RRoMeKnNBDcenCYS2H0ig5o24rRSv-3e3S8FPf8GswGXou0rhpgMevXVixk5JRkNI3Yf-oDJstBXA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3130968274</pqid></control><display><type>article</type><title>Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning</title><source>Freely Accessible Journals</source><creator>Yan, Xudong ; Feng, Songhe ; Zhang, Yang ; Yang, Jian ; Lin, Yueguan ; Fei, Haojun</creator><creatorcontrib>Yan, Xudong ; Feng, Songhe ; Zhang, Yang ; Yang, Jian ; Lin, Yueguan ; Fei, Haojun</creatorcontrib><description>Compositional zero-shot learning (CZSL) aims to recognize novel compositions of attributes and objects learned from seen compositions. Previous works disentangle attribute and object by extracting shared and exclusive parts between image pairs sharing the same attribute (object), as well as aligning them with pretrained word embeddings to improve unseen attribute-object recognition. Despite the significant achievements of existing efforts, they are hampered by three limitations: (1) the efficacy of disentanglement is compromised due to the influence of the background and the intricate entanglement of attribute with object in the same parts. (2) existing word embeddings fail to capture complex multimodal semantic information. (3) overconfidence exhibited by existing models in seen compositions hinders their generalization to novel compositions. Being aware of these, we propose a novel framework named Multimodal Large Language Model (MLLM) embeddings and attribute smoothing guided disentanglement (TRIDENT) for CZSL. First, we leverage feature adaptive aggregation modules to mitigate the impact of background, and utilize learnable condition masks to capture multigranularity features for disentanglement. Then, the last hidden states of MLLM are employed as word embeddings for their superior representation capabilities. Moreover, we propose attribute smoothing with auxiliary attributes generated by Large Language Model (LLM) for seen compositions, addressing the issue of overconfidence by encouraging the model to learn more attributes in one given composition. Extensive experiments demonstrate that TRIDENT achieves state-of-the-art performance on three benchmarks.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Composition ; Entanglement ; Large language models ; Object recognition ; Smoothing ; Words (language) ; Zero-shot learning</subject><ispartof>arXiv.org, 2024-11</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>778,782</link.rule.ids></links><search><creatorcontrib>Yan, Xudong</creatorcontrib><creatorcontrib>Feng, Songhe</creatorcontrib><creatorcontrib>Zhang, Yang</creatorcontrib><creatorcontrib>Yang, Jian</creatorcontrib><creatorcontrib>Lin, Yueguan</creatorcontrib><creatorcontrib>Fei, Haojun</creatorcontrib><title>Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning</title><title>arXiv.org</title><description>Compositional zero-shot learning (CZSL) aims to recognize novel compositions of attributes and objects learned from seen compositions. Previous works disentangle attribute and object by extracting shared and exclusive parts between image pairs sharing the same attribute (object), as well as aligning them with pretrained word embeddings to improve unseen attribute-object recognition. Despite the significant achievements of existing efforts, they are hampered by three limitations: (1) the efficacy of disentanglement is compromised due to the influence of the background and the intricate entanglement of attribute with object in the same parts. (2) existing word embeddings fail to capture complex multimodal semantic information. (3) overconfidence exhibited by existing models in seen compositions hinders their generalization to novel compositions. Being aware of these, we propose a novel framework named Multimodal Large Language Model (MLLM) embeddings and attribute smoothing guided disentanglement (TRIDENT) for CZSL. First, we leverage feature adaptive aggregation modules to mitigate the impact of background, and utilize learnable condition masks to capture multigranularity features for disentanglement. Then, the last hidden states of MLLM are employed as word embeddings for their superior representation capabilities. Moreover, we propose attribute smoothing with auxiliary attributes generated by Large Language Model (LLM) for seen compositions, addressing the issue of overconfidence by encouraging the model to learn more attributes in one given composition. Extensive experiments demonstrate that TRIDENT achieves state-of-the-art performance on three benchmarks.</description><subject>Composition</subject><subject>Entanglement</subject><subject>Large language models</subject><subject>Object recognition</subject><subject>Smoothing</subject><subject>Words (language)</subject><subject>Zero-shot learning</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNy08LgjAcxvERBEn5Hgadhbn5r2OI0WGeCoIuMnHqRPezbfb6M-gFdHr4wufZII8yFgZZROkO-dYOhBCapDSOmYceXL6lEZ3SHS45L3Ex1bJp1rRY6AafnTOqXpzEtwnA9V_XgsE5TDNY5RRoMeKnNBDcenCYS2H0ig5o24rRSv-3e3S8FPf8GswGXou0rhpgMevXVixk5JRkNI3Yf-oDJstBXA</recordid><startdate>20241118</startdate><enddate>20241118</enddate><creator>Yan, Xudong</creator><creator>Feng, Songhe</creator><creator>Zhang, Yang</creator><creator>Yang, Jian</creator><creator>Lin, Yueguan</creator><creator>Fei, Haojun</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241118</creationdate><title>Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning</title><author>Yan, Xudong ; Feng, Songhe ; Zhang, Yang ; Yang, Jian ; Lin, Yueguan ; Fei, Haojun</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31309682743</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Composition</topic><topic>Entanglement</topic><topic>Large language models</topic><topic>Object recognition</topic><topic>Smoothing</topic><topic>Words (language)</topic><topic>Zero-shot learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Yan, Xudong</creatorcontrib><creatorcontrib>Feng, Songhe</creatorcontrib><creatorcontrib>Zhang, Yang</creatorcontrib><creatorcontrib>Yang, Jian</creatorcontrib><creatorcontrib>Lin, Yueguan</creatorcontrib><creatorcontrib>Fei, Haojun</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Yan, Xudong</au><au>Feng, Songhe</au><au>Zhang, Yang</au><au>Yang, Jian</au><au>Lin, Yueguan</au><au>Fei, Haojun</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning</atitle><jtitle>arXiv.org</jtitle><date>2024-11-18</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Compositional zero-shot learning (CZSL) aims to recognize novel compositions of attributes and objects learned from seen compositions. Previous works disentangle attribute and object by extracting shared and exclusive parts between image pairs sharing the same attribute (object), as well as aligning them with pretrained word embeddings to improve unseen attribute-object recognition. Despite the significant achievements of existing efforts, they are hampered by three limitations: (1) the efficacy of disentanglement is compromised due to the influence of the background and the intricate entanglement of attribute with object in the same parts. (2) existing word embeddings fail to capture complex multimodal semantic information. (3) overconfidence exhibited by existing models in seen compositions hinders their generalization to novel compositions. Being aware of these, we propose a novel framework named Multimodal Large Language Model (MLLM) embeddings and attribute smoothing guided disentanglement (TRIDENT) for CZSL. First, we leverage feature adaptive aggregation modules to mitigate the impact of background, and utilize learnable condition masks to capture multigranularity features for disentanglement. Then, the last hidden states of MLLM are employed as word embeddings for their superior representation capabilities. Moreover, we propose attribute smoothing with auxiliary attributes generated by Large Language Model (LLM) for seen compositions, addressing the issue of overconfidence by encouraging the model to learn more attributes in one given composition. Extensive experiments demonstrate that TRIDENT achieves state-of-the-art performance on three benchmarks.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-11
issn 2331-8422
language eng
recordid cdi_proquest_journals_3130968274
source Freely Accessible Journals
subjects Composition
Entanglement
Large language models
Object recognition
Smoothing
Words (language)
Zero-shot learning
title Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-16T17%3A37%3A42IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Leveraging%20MLLM%20Embeddings%20and%20Attribute%20Smoothing%20for%20Compositional%20Zero-Shot%20Learning&rft.jtitle=arXiv.org&rft.au=Yan,%20Xudong&rft.date=2024-11-18&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3130968274%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3130968274&rft_id=info:pmid/&rfr_iscdi=true