MoVA: Adapting Mixture of Vision Experts to Multimodal Context

As the key component in multimodal large language models (MLLMs), the ability of the visual encoder greatly affects MLLM's understanding on diverse image content. Although some large-scale pretrained vision encoders such as vision encoders in CLIP and DINOv2 have brought promising performance,...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-10
Hauptverfasser:	Zong, Zhuofan, Ma, Bingqi, Shen, Dazhong, Song, Guanglu, Shao, Hao, Jiang, Dongzhi, Li, Hongsheng, Liu, Yu
Format:	Artikel
Sprache:	eng
Schlagworte:	Adapters Coders Context Large language models Mixtures Performance evaluation Vision
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Schreiben Sie den ersten Kommentar!