Boosting Audio-visual Zero-shot Learning with Large Language Models
Audio-visual zero-shot learning aims to recognize unseen classes based on paired audio-visual sequences. Recent methods mainly focus on learning multi-modal features aligned with class names to enhance the generalization ability to unseen categories. However, these approaches ignore the obscure even...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Chen, Haoxing Li, Yaohui Hong, Yan Huang, Zizheng Xu, Zhuoer Gu, Zhangxuan Lan, Jun Zhu, Huijia Wang, Weiqiang |
description | Audio-visual zero-shot learning aims to recognize unseen classes based on
paired audio-visual sequences. Recent methods mainly focus on learning
multi-modal features aligned with class names to enhance the generalization
ability to unseen categories. However, these approaches ignore the obscure
event concepts in class names and may inevitably introduce complex network
structures with difficult training objectives. In this paper, we introduce a
straightforward yet efficient framework called KnowleDge-Augmented audio-visual
learning (KDA), which aids the model in more effectively learning novel event
content by leveraging an external knowledge base. Specifically, we first
propose to utilize the knowledge contained in large language models (LLMs) to
generate numerous descriptive sentences that include important distinguishing
audio-visual features of event classes, which helps to better understand unseen
categories. Furthermore, we propose a knowledge-aware adaptive margin loss to
help distinguish similar events, further improving the generalization ability
towards unseen classes. Extensive experimental results demonstrate that our
proposed KDA can outperform state-of-the-art methods on three popular
audio-visual zero-shot learning datasets.Our code will be avaliable at
\url{https://github.com/chenhaoxing/KDA}. |
doi_str_mv | 10.48550/arxiv.2311.12268 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2311_12268</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2311_12268</sourcerecordid><originalsourceid>FETCH-LOGICAL-a678-46eecf3629b84a2750f8eb3c8c97dc1978c25d9f20dd7c3c81d687d538e76bac3</originalsourceid><addsrcrecordid>eNotj79OwzAYxL0woJYHYCIv4BDbsf1lLBH_pCCWTizRF9tJLaUxspMCb09aWO5Od9JJP0JuWZGXIGVxj_Hbn3IuGMsZ5wquSf0QQpr9NGS7xfpATz4tOGYfLgaaDmHOGodxOu9ffj5kDcbBrToNC67hLVg3pi256nFM7ubfN2T_9LivX2jz_vxa7xqKSgMtlXOmF4pXHZTItSx6cJ0wYCptDas0GC5t1fPCWm3WnlkF2koBTqsOjdiQu7_bC0X7Gf0R4097pmkvNOIX-U9FCw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Boosting Audio-visual Zero-shot Learning with Large Language Models</title><source>arXiv.org</source><creator>Chen, Haoxing ; Li, Yaohui ; Hong, Yan ; Huang, Zizheng ; Xu, Zhuoer ; Gu, Zhangxuan ; Lan, Jun ; Zhu, Huijia ; Wang, Weiqiang</creator><creatorcontrib>Chen, Haoxing ; Li, Yaohui ; Hong, Yan ; Huang, Zizheng ; Xu, Zhuoer ; Gu, Zhangxuan ; Lan, Jun ; Zhu, Huijia ; Wang, Weiqiang</creatorcontrib><description>Audio-visual zero-shot learning aims to recognize unseen classes based on
paired audio-visual sequences. Recent methods mainly focus on learning
multi-modal features aligned with class names to enhance the generalization
ability to unseen categories. However, these approaches ignore the obscure
event concepts in class names and may inevitably introduce complex network
structures with difficult training objectives. In this paper, we introduce a
straightforward yet efficient framework called KnowleDge-Augmented audio-visual
learning (KDA), which aids the model in more effectively learning novel event
content by leveraging an external knowledge base. Specifically, we first
propose to utilize the knowledge contained in large language models (LLMs) to
generate numerous descriptive sentences that include important distinguishing
audio-visual features of event classes, which helps to better understand unseen
categories. Furthermore, we propose a knowledge-aware adaptive margin loss to
help distinguish similar events, further improving the generalization ability
towards unseen classes. Extensive experimental results demonstrate that our
proposed KDA can outperform state-of-the-art methods on three popular
audio-visual zero-shot learning datasets.Our code will be avaliable at
\url{https://github.com/chenhaoxing/KDA}.</description><identifier>DOI: 10.48550/arxiv.2311.12268</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-11</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2311.12268$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2311.12268$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Chen, Haoxing</creatorcontrib><creatorcontrib>Li, Yaohui</creatorcontrib><creatorcontrib>Hong, Yan</creatorcontrib><creatorcontrib>Huang, Zizheng</creatorcontrib><creatorcontrib>Xu, Zhuoer</creatorcontrib><creatorcontrib>Gu, Zhangxuan</creatorcontrib><creatorcontrib>Lan, Jun</creatorcontrib><creatorcontrib>Zhu, Huijia</creatorcontrib><creatorcontrib>Wang, Weiqiang</creatorcontrib><title>Boosting Audio-visual Zero-shot Learning with Large Language Models</title><description>Audio-visual zero-shot learning aims to recognize unseen classes based on
paired audio-visual sequences. Recent methods mainly focus on learning
multi-modal features aligned with class names to enhance the generalization
ability to unseen categories. However, these approaches ignore the obscure
event concepts in class names and may inevitably introduce complex network
structures with difficult training objectives. In this paper, we introduce a
straightforward yet efficient framework called KnowleDge-Augmented audio-visual
learning (KDA), which aids the model in more effectively learning novel event
content by leveraging an external knowledge base. Specifically, we first
propose to utilize the knowledge contained in large language models (LLMs) to
generate numerous descriptive sentences that include important distinguishing
audio-visual features of event classes, which helps to better understand unseen
categories. Furthermore, we propose a knowledge-aware adaptive margin loss to
help distinguish similar events, further improving the generalization ability
towards unseen classes. Extensive experimental results demonstrate that our
proposed KDA can outperform state-of-the-art methods on three popular
audio-visual zero-shot learning datasets.Our code will be avaliable at
\url{https://github.com/chenhaoxing/KDA}.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj79OwzAYxL0woJYHYCIv4BDbsf1lLBH_pCCWTizRF9tJLaUxspMCb09aWO5Od9JJP0JuWZGXIGVxj_Hbn3IuGMsZ5wquSf0QQpr9NGS7xfpATz4tOGYfLgaaDmHOGodxOu9ffj5kDcbBrToNC67hLVg3pi256nFM7ubfN2T_9LivX2jz_vxa7xqKSgMtlXOmF4pXHZTItSx6cJ0wYCptDas0GC5t1fPCWm3WnlkF2koBTqsOjdiQu7_bC0X7Gf0R4097pmkvNOIX-U9FCw</recordid><startdate>20231120</startdate><enddate>20231120</enddate><creator>Chen, Haoxing</creator><creator>Li, Yaohui</creator><creator>Hong, Yan</creator><creator>Huang, Zizheng</creator><creator>Xu, Zhuoer</creator><creator>Gu, Zhangxuan</creator><creator>Lan, Jun</creator><creator>Zhu, Huijia</creator><creator>Wang, Weiqiang</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231120</creationdate><title>Boosting Audio-visual Zero-shot Learning with Large Language Models</title><author>Chen, Haoxing ; Li, Yaohui ; Hong, Yan ; Huang, Zizheng ; Xu, Zhuoer ; Gu, Zhangxuan ; Lan, Jun ; Zhu, Huijia ; Wang, Weiqiang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a678-46eecf3629b84a2750f8eb3c8c97dc1978c25d9f20dd7c3c81d687d538e76bac3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Chen, Haoxing</creatorcontrib><creatorcontrib>Li, Yaohui</creatorcontrib><creatorcontrib>Hong, Yan</creatorcontrib><creatorcontrib>Huang, Zizheng</creatorcontrib><creatorcontrib>Xu, Zhuoer</creatorcontrib><creatorcontrib>Gu, Zhangxuan</creatorcontrib><creatorcontrib>Lan, Jun</creatorcontrib><creatorcontrib>Zhu, Huijia</creatorcontrib><creatorcontrib>Wang, Weiqiang</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Chen, Haoxing</au><au>Li, Yaohui</au><au>Hong, Yan</au><au>Huang, Zizheng</au><au>Xu, Zhuoer</au><au>Gu, Zhangxuan</au><au>Lan, Jun</au><au>Zhu, Huijia</au><au>Wang, Weiqiang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Boosting Audio-visual Zero-shot Learning with Large Language Models</atitle><date>2023-11-20</date><risdate>2023</risdate><abstract>Audio-visual zero-shot learning aims to recognize unseen classes based on
paired audio-visual sequences. Recent methods mainly focus on learning
multi-modal features aligned with class names to enhance the generalization
ability to unseen categories. However, these approaches ignore the obscure
event concepts in class names and may inevitably introduce complex network
structures with difficult training objectives. In this paper, we introduce a
straightforward yet efficient framework called KnowleDge-Augmented audio-visual
learning (KDA), which aids the model in more effectively learning novel event
content by leveraging an external knowledge base. Specifically, we first
propose to utilize the knowledge contained in large language models (LLMs) to
generate numerous descriptive sentences that include important distinguishing
audio-visual features of event classes, which helps to better understand unseen
categories. Furthermore, we propose a knowledge-aware adaptive margin loss to
help distinguish similar events, further improving the generalization ability
towards unseen classes. Extensive experimental results demonstrate that our
proposed KDA can outperform state-of-the-art methods on three popular
audio-visual zero-shot learning datasets.Our code will be avaliable at
\url{https://github.com/chenhaoxing/KDA}.</abstract><doi>10.48550/arxiv.2311.12268</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2311.12268 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2311_12268 |
source | arXiv.org |
subjects | Computer Science - Computer Vision and Pattern Recognition |
title | Boosting Audio-visual Zero-shot Learning with Large Language Models |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-08T19%3A37%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Boosting%20Audio-visual%20Zero-shot%20Learning%20with%20Large%20Language%20Models&rft.au=Chen,%20Haoxing&rft.date=2023-11-20&rft_id=info:doi/10.48550/arxiv.2311.12268&rft_dat=%3Carxiv_GOX%3E2311_12268%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |