MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders

The rapid advancements in large language models (LLMs) have significantly enhanced natural language processing capabilities, facilitating the development of AudioLLMs that process and understand speech and audio inputs alongside text. Existing AudioLLMs typically combine a pre-trained audio encoder...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Zhang, Wenyu, Sun, Shuo, Wang, Bin, Zou, Xunlong, Liu, Zhuohan, He, Yingxu, Lin, Geyu, Chen, Nancy F, Aw, Ai Ti
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Zhang, Wenyu
Sun, Shuo
Wang, Bin
Zou, Xunlong
Liu, Zhuohan
He, Yingxu
Lin, Geyu
Chen, Nancy F
Aw, Ai Ti
description The rapid advancements in large language models (LLMs) have significantly enhanced natural language processing capabilities, facilitating the development of AudioLLMs that process and understand speech and audio inputs alongside text. Existing AudioLLMs typically combine a pre-trained audio encoder with a pre-trained LLM, which are subsequently finetuned on specific audio tasks. However, the pre-trained audio encoder has constrained capacity to capture features for new tasks and datasets. To address this, we propose to incorporate mixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE supplements a base encoder with a pool of relatively light weight encoders, selectively activated based on the audio input to enhance feature extraction without significantly increasing model size. Our empirical results demonstrate that MoWE effectively improves multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.
doi_str_mv 10.48550/arxiv.2409.06635
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2409_06635</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2409_06635</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2409_066353</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DMwMzM25WSw9c0Pd9V1LE3JzLdS8C3NKcksSSzOVgAL-Pj4FiuUZ5ZkKPhmVpSUFqUq5KcphKcmZiu45iXnp6QWFfMwsKYl5hSn8kJpbgZ5N9cQZw9dsEXxBUWZuYlFlfEgC-PBFhoTVgEAyD40rQ</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders</title><source>arXiv.org</source><creator>Zhang, Wenyu ; Sun, Shuo ; Wang, Bin ; Zou, Xunlong ; Liu, Zhuohan ; He, Yingxu ; Lin, Geyu ; Chen, Nancy F ; Aw, Ai Ti</creator><creatorcontrib>Zhang, Wenyu ; Sun, Shuo ; Wang, Bin ; Zou, Xunlong ; Liu, Zhuohan ; He, Yingxu ; Lin, Geyu ; Chen, Nancy F ; Aw, Ai Ti</creatorcontrib><description>The rapid advancements in large language models (LLMs) have significantly enhanced natural language processing capabilities, facilitating the development of AudioLLMs that process and understand speech and audio inputs alongside text. Existing AudioLLMs typically combine a pre-trained audio encoder with a pre-trained LLM, which are subsequently finetuned on specific audio tasks. However, the pre-trained audio encoder has constrained capacity to capture features for new tasks and datasets. To address this, we propose to incorporate mixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE supplements a base encoder with a pool of relatively light weight encoders, selectively activated based on the audio input to enhance feature extraction without significantly increasing model size. Our empirical results demonstrate that MoWE effectively improves multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.</description><identifier>DOI: 10.48550/arxiv.2409.06635</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Computer Science - Sound</subject><creationdate>2024-09</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2409.06635$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2409.06635$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zhang, Wenyu</creatorcontrib><creatorcontrib>Sun, Shuo</creatorcontrib><creatorcontrib>Wang, Bin</creatorcontrib><creatorcontrib>Zou, Xunlong</creatorcontrib><creatorcontrib>Liu, Zhuohan</creatorcontrib><creatorcontrib>He, Yingxu</creatorcontrib><creatorcontrib>Lin, Geyu</creatorcontrib><creatorcontrib>Chen, Nancy F</creatorcontrib><creatorcontrib>Aw, Ai Ti</creatorcontrib><title>MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders</title><description>The rapid advancements in large language models (LLMs) have significantly enhanced natural language processing capabilities, facilitating the development of AudioLLMs that process and understand speech and audio inputs alongside text. Existing AudioLLMs typically combine a pre-trained audio encoder with a pre-trained LLM, which are subsequently finetuned on specific audio tasks. However, the pre-trained audio encoder has constrained capacity to capture features for new tasks and datasets. To address this, we propose to incorporate mixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE supplements a base encoder with a pool of relatively light weight encoders, selectively activated based on the audio input to enhance feature extraction without significantly increasing model size. Our empirical results demonstrate that MoWE effectively improves multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DMwMzM25WSw9c0Pd9V1LE3JzLdS8C3NKcksSSzOVgAL-Pj4FiuUZ5ZkKPhmVpSUFqUq5KcphKcmZiu45iXnp6QWFfMwsKYl5hSn8kJpbgZ5N9cQZw9dsEXxBUWZuYlFlfEgC-PBFhoTVgEAyD40rQ</recordid><startdate>20240910</startdate><enddate>20240910</enddate><creator>Zhang, Wenyu</creator><creator>Sun, Shuo</creator><creator>Wang, Bin</creator><creator>Zou, Xunlong</creator><creator>Liu, Zhuohan</creator><creator>He, Yingxu</creator><creator>Lin, Geyu</creator><creator>Chen, Nancy F</creator><creator>Aw, Ai Ti</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240910</creationdate><title>MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders</title><author>Zhang, Wenyu ; Sun, Shuo ; Wang, Bin ; Zou, Xunlong ; Liu, Zhuohan ; He, Yingxu ; Lin, Geyu ; Chen, Nancy F ; Aw, Ai Ti</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2409_066353</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Wenyu</creatorcontrib><creatorcontrib>Sun, Shuo</creatorcontrib><creatorcontrib>Wang, Bin</creatorcontrib><creatorcontrib>Zou, Xunlong</creatorcontrib><creatorcontrib>Liu, Zhuohan</creatorcontrib><creatorcontrib>He, Yingxu</creatorcontrib><creatorcontrib>Lin, Geyu</creatorcontrib><creatorcontrib>Chen, Nancy F</creatorcontrib><creatorcontrib>Aw, Ai Ti</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zhang, Wenyu</au><au>Sun, Shuo</au><au>Wang, Bin</au><au>Zou, Xunlong</au><au>Liu, Zhuohan</au><au>He, Yingxu</au><au>Lin, Geyu</au><au>Chen, Nancy F</au><au>Aw, Ai Ti</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders</atitle><date>2024-09-10</date><risdate>2024</risdate><abstract>The rapid advancements in large language models (LLMs) have significantly enhanced natural language processing capabilities, facilitating the development of AudioLLMs that process and understand speech and audio inputs alongside text. Existing AudioLLMs typically combine a pre-trained audio encoder with a pre-trained LLM, which are subsequently finetuned on specific audio tasks. However, the pre-trained audio encoder has constrained capacity to capture features for new tasks and datasets. To address this, we propose to incorporate mixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE supplements a base encoder with a pool of relatively light weight encoders, selectively activated based on the audio input to enhance feature extraction without significantly increasing model size. Our empirical results demonstrate that MoWE effectively improves multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.</abstract><doi>10.48550/arxiv.2409.06635</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2409.06635
ispartof
issn
language eng
recordid cdi_arxiv_primary_2409_06635
source arXiv.org
subjects Computer Science - Artificial Intelligence
Computer Science - Computation and Language
Computer Science - Sound
title MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-07T18%3A45%3A05IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=MoWE-Audio:%20Multitask%20AudioLLMs%20with%20Mixture%20of%20Weak%20Encoders&rft.au=Zhang,%20Wenyu&rft.date=2024-09-10&rft_id=info:doi/10.48550/arxiv.2409.06635&rft_dat=%3Carxiv_GOX%3E2409_06635%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true