MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders

The rapid advancements in large language models (LLMs) have significantly enhanced natural language processing capabilities, facilitating the development of AudioLLMs that process and understand speech and audio inputs alongside text. Existing AudioLLMs typically combine a pre-trained audio encoder...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Zhang, Wenyu, Sun, Shuo, Wang, Bin, Zou, Xunlong, Liu, Zhuohan, He, Yingxu, Lin, Geyu, Chen, Nancy F, Aw, Ai Ti
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Sound
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Zhang, Wenyu Sun, Shuo Wang, Bin Zou, Xunlong Liu, Zhuohan He, Yingxu Lin, Geyu Chen, Nancy F Aw, Ai Ti
description	The rapid advancements in large language models (LLMs) have significantly enhanced natural language processing capabilities, facilitating the development of AudioLLMs that process and understand speech and audio inputs alongside text. Existing AudioLLMs typically combine a pre-trained audio encoder with a pre-trained LLM, which are subsequently finetuned on specific audio tasks. However, the pre-trained audio encoder has constrained capacity to capture features for new tasks and datasets. To address this, we propose to incorporate mixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE supplements a base encoder with a pool of relatively light weight encoders, selectively activated based on the audio input to enhance feature extraction without significantly increasing model size. Our empirical results demonstrate that MoWE effectively improves multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.
doi_str_mv	10.48550/arxiv.2409.06635
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2409_06635</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2409_06635</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2409_066353</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DMwMzM25WSw9c0Pd9V1LE3JzLdS8C3NKcksSSzOVgAL-Pj4FiuUZ5ZkKPhmVpSUFqUq5KcphKcmZiu45iXnp6QWFfMwsKYl5hSn8kJpbgZ5N9cQZw9dsEXxBUWZuYlFlfEgC-PBFhoTVgEAyD40rQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders</title><source>arXiv.org</source><creator>Zhang, Wenyu ; Sun, Shuo ; Wang, Bin ; Zou, Xunlong ; Liu, Zhuohan ; He, Yingxu ; Lin, Geyu ; Chen, Nancy F ; Aw, Ai Ti</creator><creatorcontrib>Zhang, Wenyu ; Sun, Shuo ; Wang, Bin ; Zou, Xunlong ; Liu, Zhuohan ; He, Yingxu ; Lin, Geyu ; Chen, Nancy F ; Aw, Ai Ti</creatorcontrib><description>The rapid advancements in large language models (LLMs) have significantly enhanced natural language processing capabilities, facilitating the development of AudioLLMs that process and understand speech and audio inputs alongside text. Existing AudioLLMs typically combine a pre-trained audio encoder with a pre-trained LLM, which are subsequently finetuned on specific audio tasks. However, the pre-trained audio encoder has constrained capacity to capture features for new tasks and datasets. To address this, we propose to incorporate mixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE supplements a base encoder with a pool of relatively light weight encoders, selectively activated based on the audio input to enhance feature extraction without significantly increasing model size. Our empirical results demonstrate that MoWE effectively improves multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.</description><identifier>DOI: 10.48550/arxiv.2409.06635</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Computer Science - Sound</subject><creationdate>2024-09</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2409.06635$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2409.06635$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhang, Wenyu</creatorcontrib><creatorcontrib>Sun, Shuo</creatorcontrib><creatorcontrib>Wang, Bin</creatorcontrib><creatorcontrib>Zou, Xunlong</creatorcontrib><creatorcontrib>Liu, Zhuohan</creatorcontrib><creatorcontrib>He, Yingxu</creatorcontrib><creatorcontrib>Lin, Geyu</creatorcontrib><creatorcontrib>Chen, Nancy F</creatorcontrib><creatorcontrib>Aw, Ai Ti</creatorcontrib><title>MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders</title><description>The rapid advancements in large language models (LLMs) have significantly enhanced natural language processing capabilities, facilitating the development of AudioLLMs that process and understand speech and audio inputs alongside text. Existing AudioLLMs typically combine a pre-trained audio encoder with a pre-trained LLM, which are subsequently finetuned on specific audio tasks. However, the pre-trained audio encoder has constrained capacity to capture features for new tasks and datasets. To address this, we propose to incorporate mixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE supplements a base encoder with a pool of relatively light weight encoders, selectively activated based on the audio input to enhance feature extraction without significantly increasing model size. Our empirical results demonstrate that MoWE effectively improves multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DMwMzM25WSw9c0Pd9V1LE3JzLdS8C3NKcksSSzOVgAL-Pj4FiuUZ5ZkKPhmVpSUFqUq5KcphKcmZiu45iXnp6QWFfMwsKYl5hSn8kJpbgZ5N9cQZw9dsEXxBUWZuYlFlfEgC-PBFhoTVgEAyD40rQ</recordid><startdate>20240910</startdate><enddate>20240910</enddate><creator>Zhang, Wenyu</creator><creator>Sun, Shuo</creator><creator>Wang, Bin</creator><creator>Zou, Xunlong</creator><creator>Liu, Zhuohan</creator><creator>He, Yingxu</creator><creator>Lin, Geyu</creator><creator>Chen, Nancy F</creator><creator>Aw, Ai Ti</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240910</creationdate><title>MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders</title><author>Zhang, Wenyu ; Sun, Shuo ; Wang, Bin ; Zou, Xunlong ; Liu, Zhuohan ; He, Yingxu ; Lin, Geyu ; Chen, Nancy F ; Aw, Ai Ti</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2409_066353</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Wenyu</creatorcontrib><creatorcontrib>Sun, Shuo</creatorcontrib><creatorcontrib>Wang, Bin</creatorcontrib><creatorcontrib>Zou, Xunlong</creatorcontrib><creatorcontrib>Liu, Zhuohan</creatorcontrib><creatorcontrib>He, Yingxu</creatorcontrib><creatorcontrib>Lin, Geyu</creatorcontrib><creatorcontrib>Chen, Nancy F</creatorcontrib><creatorcontrib>Aw, Ai Ti</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhang, Wenyu</au><au>Sun, Shuo</au><au>Wang, Bin</au><au>Zou, Xunlong</au><au>Liu, Zhuohan</au><au>He, Yingxu</au><au>Lin, Geyu</au><au>Chen, Nancy F</au><au>Aw, Ai Ti</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders</atitle><date>2024-09-10</date><risdate>2024</risdate><abstract>The rapid advancements in large language models (LLMs) have significantly enhanced natural language processing capabilities, facilitating the development of AudioLLMs that process and understand speech and audio inputs alongside text. Existing AudioLLMs typically combine a pre-trained audio encoder with a pre-trained LLM, which are subsequently finetuned on specific audio tasks. However, the pre-trained audio encoder has constrained capacity to capture features for new tasks and datasets. To address this, we propose to incorporate mixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE supplements a base encoder with a pool of relatively light weight encoders, selectively activated based on the audio input to enhance feature extraction without significantly increasing model size. Our empirical results demonstrate that MoWE effectively improves multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.</abstract><doi>10.48550/arxiv.2409.06635</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2409.06635
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2409_06635
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Sound
title	MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T18%3A45%3A05IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=MoWE-Audio:%20Multitask%20AudioLLMs%20with%20Mixture%20of%20Weak%20Encoders&rft.au=Zhang,%20Wenyu&rft.date=2024-09-10&rft_id=info:doi/10.48550/arxiv.2409.06635&rft_dat=%3Carxiv_GOX%3E2409_06635%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true