MoVA: Adapting Mixture of Vision Experts to Multimodal Context

As the key component in multimodal large language models (MLLMs), the ability of the visual encoder greatly affects MLLM's understanding on diverse image content. Although some large-scale pretrained vision encoders such as vision encoders in CLIP and DINOv2 have brought promising performance,...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-10
Hauptverfasser:	Zong, Zhuofan, Ma, Bingqi, Shen, Dazhong, Song, Guanglu, Shao, Hao, Jiang, Dongzhi, Li, Hongsheng, Liu, Yu
Format:	Artikel
Sprache:	eng
Schlagworte:	Adapters Coders Context Large language models Mixtures Performance evaluation Vision
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Zong, Zhuofan Ma, Bingqi Shen, Dazhong Song, Guanglu Shao, Hao Jiang, Dongzhi Li, Hongsheng Liu, Yu
description	As the key component in multimodal large language models (MLLMs), the ability of the visual encoder greatly affects MLLM's understanding on diverse image content. Although some large-scale pretrained vision encoders such as vision encoders in CLIP and DINOv2 have brought promising performance, we found that there is still no single vision encoder that can dominate various image content understanding, e.g., the CLIP vision encoder leads to outstanding results on general image understanding but poor performance on document or chart content. To alleviate the bias of CLIP vision encoder, we first delve into the inherent behavior of different pre-trained vision encoders and then propose the MoVA, a powerful and novel MLLM, adaptively routing and fusing task-specific vision experts with a coarse-to-fine mechanism. In the coarse-grained stage, we design a context-aware expert routing strategy to dynamically select the most suitable vision experts according to the user instruction, input image, and expertise of vision experts. This benefits from the powerful model function understanding ability of the large language model (LLM). In the fine-grained stage, we elaborately conduct the mixture-of-vision-expert adapter (MoV-Adapter) to extract and fuse task-specific knowledge from various experts. This coarse-to-fine paradigm effectively leverages representations from experts based on multimodal context and model expertise, further enhancing the generalization ability. We conduct extensive experiments to evaluate the effectiveness of the proposed approach. Without any bells and whistles, MoVA can achieve significant performance gains over current state-of-the-art methods in a wide range of challenging multimodal benchmarks.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3043503478</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3043503478</sourcerecordid><originalsourceid>FETCH-proquest_journals_30435034783</originalsourceid><addsrcrecordid>eNqNysEKgkAQgOElCJLyHQY6C9uOpnQIRIwu3sKrCK6xYju2Ows-fh16gE7_4fs3IlKIp6RIldqJ2PtJSqnOucoyjMS1oba8QDn0Cxv7hMasHJwGGqE13pCFel20Yw9M0ISZzYuGfoaKLOuVD2I79rPX8a97cbzVj-qeLI7eQXvuJgrOfqlDmWImMc0L_O_6AGdhN8g</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3043503478</pqid></control><display><type>article</type><title>MoVA: Adapting Mixture of Vision Experts to Multimodal Context</title><source>Free E- Journals</source><creator>Zong, Zhuofan ; Ma, Bingqi ; Shen, Dazhong ; Song, Guanglu ; Shao, Hao ; Jiang, Dongzhi ; Li, Hongsheng ; Liu, Yu</creator><creatorcontrib>Zong, Zhuofan ; Ma, Bingqi ; Shen, Dazhong ; Song, Guanglu ; Shao, Hao ; Jiang, Dongzhi ; Li, Hongsheng ; Liu, Yu</creatorcontrib><description>As the key component in multimodal large language models (MLLMs), the ability of the visual encoder greatly affects MLLM's understanding on diverse image content. Although some large-scale pretrained vision encoders such as vision encoders in CLIP and DINOv2 have brought promising performance, we found that there is still no single vision encoder that can dominate various image content understanding, e.g., the CLIP vision encoder leads to outstanding results on general image understanding but poor performance on document or chart content. To alleviate the bias of CLIP vision encoder, we first delve into the inherent behavior of different pre-trained vision encoders and then propose the MoVA, a powerful and novel MLLM, adaptively routing and fusing task-specific vision experts with a coarse-to-fine mechanism. In the coarse-grained stage, we design a context-aware expert routing strategy to dynamically select the most suitable vision experts according to the user instruction, input image, and expertise of vision experts. This benefits from the powerful model function understanding ability of the large language model (LLM). In the fine-grained stage, we elaborately conduct the mixture-of-vision-expert adapter (MoV-Adapter) to extract and fuse task-specific knowledge from various experts. This coarse-to-fine paradigm effectively leverages representations from experts based on multimodal context and model expertise, further enhancing the generalization ability. We conduct extensive experiments to evaluate the effectiveness of the proposed approach. Without any bells and whistles, MoVA can achieve significant performance gains over current state-of-the-art methods in a wide range of challenging multimodal benchmarks.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Adapters ; Coders ; Context ; Large language models ; Mixtures ; Performance evaluation ; Vision</subject><ispartof>arXiv.org, 2024-10</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Zong, Zhuofan</creatorcontrib><creatorcontrib>Ma, Bingqi</creatorcontrib><creatorcontrib>Shen, Dazhong</creatorcontrib><creatorcontrib>Song, Guanglu</creatorcontrib><creatorcontrib>Shao, Hao</creatorcontrib><creatorcontrib>Jiang, Dongzhi</creatorcontrib><creatorcontrib>Li, Hongsheng</creatorcontrib><creatorcontrib>Liu, Yu</creatorcontrib><title>MoVA: Adapting Mixture of Vision Experts to Multimodal Context</title><title>arXiv.org</title><description>As the key component in multimodal large language models (MLLMs), the ability of the visual encoder greatly affects MLLM's understanding on diverse image content. Although some large-scale pretrained vision encoders such as vision encoders in CLIP and DINOv2 have brought promising performance, we found that there is still no single vision encoder that can dominate various image content understanding, e.g., the CLIP vision encoder leads to outstanding results on general image understanding but poor performance on document or chart content. To alleviate the bias of CLIP vision encoder, we first delve into the inherent behavior of different pre-trained vision encoders and then propose the MoVA, a powerful and novel MLLM, adaptively routing and fusing task-specific vision experts with a coarse-to-fine mechanism. In the coarse-grained stage, we design a context-aware expert routing strategy to dynamically select the most suitable vision experts according to the user instruction, input image, and expertise of vision experts. This benefits from the powerful model function understanding ability of the large language model (LLM). In the fine-grained stage, we elaborately conduct the mixture-of-vision-expert adapter (MoV-Adapter) to extract and fuse task-specific knowledge from various experts. This coarse-to-fine paradigm effectively leverages representations from experts based on multimodal context and model expertise, further enhancing the generalization ability. We conduct extensive experiments to evaluate the effectiveness of the proposed approach. Without any bells and whistles, MoVA can achieve significant performance gains over current state-of-the-art methods in a wide range of challenging multimodal benchmarks.</description><subject>Adapters</subject><subject>Coders</subject><subject>Context</subject><subject>Large language models</subject><subject>Mixtures</subject><subject>Performance evaluation</subject><subject>Vision</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNysEKgkAQgOElCJLyHQY6C9uOpnQIRIwu3sKrCK6xYju2Ows-fh16gE7_4fs3IlKIp6RIldqJ2PtJSqnOucoyjMS1oba8QDn0Cxv7hMasHJwGGqE13pCFel20Yw9M0ISZzYuGfoaKLOuVD2I79rPX8a97cbzVj-qeLI7eQXvuJgrOfqlDmWImMc0L_O_6AGdhN8g</recordid><startdate>20241031</startdate><enddate>20241031</enddate><creator>Zong, Zhuofan</creator><creator>Ma, Bingqi</creator><creator>Shen, Dazhong</creator><creator>Song, Guanglu</creator><creator>Shao, Hao</creator><creator>Jiang, Dongzhi</creator><creator>Li, Hongsheng</creator><creator>Liu, Yu</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241031</creationdate><title>MoVA: Adapting Mixture of Vision Experts to Multimodal Context</title><author>Zong, Zhuofan ; Ma, Bingqi ; Shen, Dazhong ; Song, Guanglu ; Shao, Hao ; Jiang, Dongzhi ; Li, Hongsheng ; Liu, Yu</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30435034783</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Adapters</topic><topic>Coders</topic><topic>Context</topic><topic>Large language models</topic><topic>Mixtures</topic><topic>Performance evaluation</topic><topic>Vision</topic><toplevel>online_resources</toplevel><creatorcontrib>Zong, Zhuofan</creatorcontrib><creatorcontrib>Ma, Bingqi</creatorcontrib><creatorcontrib>Shen, Dazhong</creatorcontrib><creatorcontrib>Song, Guanglu</creatorcontrib><creatorcontrib>Shao, Hao</creatorcontrib><creatorcontrib>Jiang, Dongzhi</creatorcontrib><creatorcontrib>Li, Hongsheng</creatorcontrib><creatorcontrib>Liu, Yu</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zong, Zhuofan</au><au>Ma, Bingqi</au><au>Shen, Dazhong</au><au>Song, Guanglu</au><au>Shao, Hao</au><au>Jiang, Dongzhi</au><au>Li, Hongsheng</au><au>Liu, Yu</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>MoVA: Adapting Mixture of Vision Experts to Multimodal Context</atitle><jtitle>arXiv.org</jtitle><date>2024-10-31</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>As the key component in multimodal large language models (MLLMs), the ability of the visual encoder greatly affects MLLM's understanding on diverse image content. Although some large-scale pretrained vision encoders such as vision encoders in CLIP and DINOv2 have brought promising performance, we found that there is still no single vision encoder that can dominate various image content understanding, e.g., the CLIP vision encoder leads to outstanding results on general image understanding but poor performance on document or chart content. To alleviate the bias of CLIP vision encoder, we first delve into the inherent behavior of different pre-trained vision encoders and then propose the MoVA, a powerful and novel MLLM, adaptively routing and fusing task-specific vision experts with a coarse-to-fine mechanism. In the coarse-grained stage, we design a context-aware expert routing strategy to dynamically select the most suitable vision experts according to the user instruction, input image, and expertise of vision experts. This benefits from the powerful model function understanding ability of the large language model (LLM). In the fine-grained stage, we elaborately conduct the mixture-of-vision-expert adapter (MoV-Adapter) to extract and fuse task-specific knowledge from various experts. This coarse-to-fine paradigm effectively leverages representations from experts based on multimodal context and model expertise, further enhancing the generalization ability. We conduct extensive experiments to evaluate the effectiveness of the proposed approach. Without any bells and whistles, MoVA can achieve significant performance gains over current state-of-the-art methods in a wide range of challenging multimodal benchmarks.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-10
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3043503478
source	Free E- Journals
subjects	Adapters Coders Context Large language models Mixtures Performance evaluation Vision
title	MoVA: Adapting Mixture of Vision Experts to Multimodal Context
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-02T23%3A40%3A48IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=MoVA:%20Adapting%20Mixture%20of%20Vision%20Experts%20to%20Multimodal%20Context&rft.jtitle=arXiv.org&rft.au=Zong,%20Zhuofan&rft.date=2024-10-31&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3043503478%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3043503478&rft_id=info:pmid/&rfr_iscdi=true