HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models

Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features into text-like tokens using a \emph{static} vision-language mapper, ther...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-03
Hauptverfasser:	Zhang, Wenqiao, Lin, Tianwei, Liu, Jiang, Fangxun Shu, Li, Haoyuan, Zhang, Lei, He Wanggui, Zhou, Hao, Lv, Zheqi, Jiang, Hao, Li, Juncheng, Tang, Siliang, Zhuang, Yueting
Format:	Artikel
Sprache:	eng
Schlagworte:	Language Large language models Mathematical models Parameters Tuning
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Zhang, Wenqiao Lin, Tianwei Liu, Jiang Fangxun Shu Li, Haoyuan Zhang, Lei He Wanggui Zhou, Hao Lv, Zheqi Jiang, Hao Li, Juncheng Tang, Siliang Zhuang, Yueting
description	Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features into text-like tokens using a \emph{static} vision-language mapper, thereby enabling \emph{static} LLMs to develop the capability to comprehend visual information through visual instruction tuning. Although promising, the \emph{static} tuning strategy~\footnote{The static tuning refers to the trained model with static parameters.} that shares the same parameters may constrain performance across different downstream multimodal tasks. In light of this, we introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert, respectively. These experts are derived from HyperNetworks, which generates adaptive parameter shifts through visual and language guidance, enabling dynamic projector and LLM modeling in two-stage training. Our experiments demonstrate that our solution significantly surpasses LLaVA on existing MLLM benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench. ~\footnote{Our project is available on the link https://github.com/DCDmllm/HyperLLaVA}.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2972951219</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2972951219</sourcerecordid><originalsourceid>FETCH-proquest_journals_29729512193</originalsourceid><addsrcrecordid>eNqNykELgjAYgOERBEn5Hz7oLOiWmd2iDA_zJh66yMgpk7nZ5iD_fR6ic6f38D4r5GFCouB0wHiDfGv7MAzxMcFxTDz0yOeRG0pZdTnDbVZsEE-ohHVMAlMNUKY6xzoO2XtxE5ROCdVBqw0UTk5i0M0iKTML-dlCN1zaHVq3TFruf7tF-3tWXvNgNPrluJ3qXjujllXjNMFpHOEoJf-pD5qcQb4</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2972951219</pqid></control><display><type>article</type><title>HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models</title><source>Free E- Journals</source><creator>Zhang, Wenqiao ; Lin, Tianwei ; Liu, Jiang ; Fangxun Shu ; Li, Haoyuan ; Zhang, Lei ; He Wanggui ; Zhou, Hao ; Lv, Zheqi ; Jiang, Hao ; Li, Juncheng ; Tang, Siliang ; Zhuang, Yueting</creator><creatorcontrib>Zhang, Wenqiao ; Lin, Tianwei ; Liu, Jiang ; Fangxun Shu ; Li, Haoyuan ; Zhang, Lei ; He Wanggui ; Zhou, Hao ; Lv, Zheqi ; Jiang, Hao ; Li, Juncheng ; Tang, Siliang ; Zhuang, Yueting</creatorcontrib><description>Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features into text-like tokens using a \emph{static} vision-language mapper, thereby enabling \emph{static} LLMs to develop the capability to comprehend visual information through visual instruction tuning. Although promising, the \emph{static} tuning strategy~\footnote{The static tuning refers to the trained model with static parameters.} that shares the same parameters may constrain performance across different downstream multimodal tasks. In light of this, we introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert, respectively. These experts are derived from HyperNetworks, which generates adaptive parameter shifts through visual and language guidance, enabling dynamic projector and LLM modeling in two-stage training. Our experiments demonstrate that our solution significantly surpasses LLaVA on existing MLLM benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench. ~\footnote{Our project is available on the link https://github.com/DCDmllm/HyperLLaVA}.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Language ; Large language models ; Mathematical models ; Parameters ; Tuning</subject><ispartof>arXiv.org, 2024-03</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>781,785</link.rule.ids></links><search><creatorcontrib>Zhang, Wenqiao</creatorcontrib><creatorcontrib>Lin, Tianwei</creatorcontrib><creatorcontrib>Liu, Jiang</creatorcontrib><creatorcontrib>Fangxun Shu</creatorcontrib><creatorcontrib>Li, Haoyuan</creatorcontrib><creatorcontrib>Zhang, Lei</creatorcontrib><creatorcontrib>He Wanggui</creatorcontrib><creatorcontrib>Zhou, Hao</creatorcontrib><creatorcontrib>Lv, Zheqi</creatorcontrib><creatorcontrib>Jiang, Hao</creatorcontrib><creatorcontrib>Li, Juncheng</creatorcontrib><creatorcontrib>Tang, Siliang</creatorcontrib><creatorcontrib>Zhuang, Yueting</creatorcontrib><title>HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models</title><title>arXiv.org</title><description>Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features into text-like tokens using a \emph{static} vision-language mapper, thereby enabling \emph{static} LLMs to develop the capability to comprehend visual information through visual instruction tuning. Although promising, the \emph{static} tuning strategy~\footnote{The static tuning refers to the trained model with static parameters.} that shares the same parameters may constrain performance across different downstream multimodal tasks. In light of this, we introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert, respectively. These experts are derived from HyperNetworks, which generates adaptive parameter shifts through visual and language guidance, enabling dynamic projector and LLM modeling in two-stage training. Our experiments demonstrate that our solution significantly surpasses LLaVA on existing MLLM benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench. ~\footnote{Our project is available on the link https://github.com/DCDmllm/HyperLLaVA}.</description><subject>Language</subject><subject>Large language models</subject><subject>Mathematical models</subject><subject>Parameters</subject><subject>Tuning</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNykELgjAYgOERBEn5Hz7oLOiWmd2iDA_zJh66yMgpk7nZ5iD_fR6ic6f38D4r5GFCouB0wHiDfGv7MAzxMcFxTDz0yOeRG0pZdTnDbVZsEE-ohHVMAlMNUKY6xzoO2XtxE5ROCdVBqw0UTk5i0M0iKTML-dlCN1zaHVq3TFruf7tF-3tWXvNgNPrluJ3qXjujllXjNMFpHOEoJf-pD5qcQb4</recordid><startdate>20240320</startdate><enddate>20240320</enddate><creator>Zhang, Wenqiao</creator><creator>Lin, Tianwei</creator><creator>Liu, Jiang</creator><creator>Fangxun Shu</creator><creator>Li, Haoyuan</creator><creator>Zhang, Lei</creator><creator>He Wanggui</creator><creator>Zhou, Hao</creator><creator>Lv, Zheqi</creator><creator>Jiang, Hao</creator><creator>Li, Juncheng</creator><creator>Tang, Siliang</creator><creator>Zhuang, Yueting</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240320</creationdate><title>HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models</title><author>Zhang, Wenqiao ; Lin, Tianwei ; Liu, Jiang ; Fangxun Shu ; Li, Haoyuan ; Zhang, Lei ; He Wanggui ; Zhou, Hao ; Lv, Zheqi ; Jiang, Hao ; Li, Juncheng ; Tang, Siliang ; Zhuang, Yueting</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_29729512193</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Language</topic><topic>Large language models</topic><topic>Mathematical models</topic><topic>Parameters</topic><topic>Tuning</topic><toplevel>online_resources</toplevel><creatorcontrib>Zhang, Wenqiao</creatorcontrib><creatorcontrib>Lin, Tianwei</creatorcontrib><creatorcontrib>Liu, Jiang</creatorcontrib><creatorcontrib>Fangxun Shu</creatorcontrib><creatorcontrib>Li, Haoyuan</creatorcontrib><creatorcontrib>Zhang, Lei</creatorcontrib><creatorcontrib>He Wanggui</creatorcontrib><creatorcontrib>Zhou, Hao</creatorcontrib><creatorcontrib>Lv, Zheqi</creatorcontrib><creatorcontrib>Jiang, Hao</creatorcontrib><creatorcontrib>Li, Juncheng</creatorcontrib><creatorcontrib>Tang, Siliang</creatorcontrib><creatorcontrib>Zhuang, Yueting</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>Proquest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Access via ProQuest (Open Access)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zhang, Wenqiao</au><au>Lin, Tianwei</au><au>Liu, Jiang</au><au>Fangxun Shu</au><au>Li, Haoyuan</au><au>Zhang, Lei</au><au>He Wanggui</au><au>Zhou, Hao</au><au>Lv, Zheqi</au><au>Jiang, Hao</au><au>Li, Juncheng</au><au>Tang, Siliang</au><au>Zhuang, Yueting</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models</atitle><jtitle>arXiv.org</jtitle><date>2024-03-20</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Recent advancements indicate that scaling up Multimodal Large Language Models (MLLMs) effectively enhances performance on downstream multimodal tasks. The prevailing MLLM paradigm, \emph{e.g.}, LLaVA, transforms visual features into text-like tokens using a \emph{static} vision-language mapper, thereby enabling \emph{static} LLMs to develop the capability to comprehend visual information through visual instruction tuning. Although promising, the \emph{static} tuning strategy~\footnote{The static tuning refers to the trained model with static parameters.} that shares the same parameters may constrain performance across different downstream multimodal tasks. In light of this, we introduce HyperLLaVA, which involves adaptive tuning of the projector and LLM parameters, in conjunction with a dynamic visual expert and language expert, respectively. These experts are derived from HyperNetworks, which generates adaptive parameter shifts through visual and language guidance, enabling dynamic projector and LLM modeling in two-stage training. Our experiments demonstrate that our solution significantly surpasses LLaVA on existing MLLM benchmarks, including MME, MMBench, SEED-Bench, and LLaVA-Bench. ~\footnote{Our project is available on the link https://github.com/DCDmllm/HyperLLaVA}.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-03
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2972951219
source	Free E- Journals
subjects	Language Large language models Mathematical models Parameters Tuning
title	HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-12T07%3A30%3A13IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=HyperLLaVA:%20Dynamic%20Visual%20and%20Language%20Expert%20Tuning%20for%20Multimodal%20Large%20Language%20Models&rft.jtitle=arXiv.org&rft.au=Zhang,%20Wenqiao&rft.date=2024-03-20&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2972951219%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2972951219&rft_id=info:pmid/&rfr_iscdi=true