Enhancing Perception Capabilities of Multimodal LLMs with Training-Free Fusion

Multimodal LLMs (MLLMs) equip language models with visual capabilities by aligning vision encoders with language models. Existing methods to enhance the visual perception of MLLMs often involve designing more powerful vision encoders, which requires exploring a vast design space and re-aligning each...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Chen, Zhuokun, Hu, Jinwu, Deng, Zeshuai, Wang, Yufeng, Zhuang, Bohan, Tan, Mingkui
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Chen, Zhuokun Hu, Jinwu Deng, Zeshuai Wang, Yufeng Zhuang, Bohan Tan, Mingkui
description	Multimodal LLMs (MLLMs) equip language models with visual capabilities by aligning vision encoders with language models. Existing methods to enhance the visual perception of MLLMs often involve designing more powerful vision encoders, which requires exploring a vast design space and re-aligning each potential encoder with the language model, resulting in prohibitively high training costs. In this paper, we introduce VisionFuse, a novel integration framework that efficiently utilizes multiple vision encoders from off-the-shelf MLLMs to enhance visual perception without requiring additional training. Our approach is motivated by the observation that different MLLMs tend to focus on distinct regions given the same query and image. Moreover, we find that the feature distributions of vision encoders within an MLLM family, a group of MLLMs sharing the same pretrained LLM, are highly aligned. Building on these insights, VisionFuse enriches the visual context by concatenating the tokens generated by the vision encoders of selected MLLMs within a family. By merging the parameters of language models from these MLLMs, VisionFuse allows a single language model to align with various vision encoders, significantly reducing deployment overhead. We conduct comprehensive evaluations across multiple multimodal benchmarks using various MLLM combinations, demonstrating substantial improvements in multimodal tasks. Notably, when integrating MiniGemini-8B and SLIME-8B, VisionFuse achieves an average performance increase of over 4%.
doi_str_mv	10.48550/arxiv.2412.01289
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2412_01289</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2412_01289</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2412_012893</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjE00jMwNLKw5GTwc83LSMxLzsxLVwhILUpOLSjJzM9TcE4sSEzKzMksyUwtVshPU_AtzSnJzM1PScxR8PHxLVYozyzJUAgpSszMA2rUdStKTVVwKy0G6uRhYE1LzClO5YXS3Azybq4hzh66YJvjC4oycxOLKuNBLogHu8CYsAoAZsY7pw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Enhancing Perception Capabilities of Multimodal LLMs with Training-Free Fusion</title><source>arXiv.org</source><creator>Chen, Zhuokun ; Hu, Jinwu ; Deng, Zeshuai ; Wang, Yufeng ; Zhuang, Bohan ; Tan, Mingkui</creator><creatorcontrib>Chen, Zhuokun ; Hu, Jinwu ; Deng, Zeshuai ; Wang, Yufeng ; Zhuang, Bohan ; Tan, Mingkui</creatorcontrib><description>Multimodal LLMs (MLLMs) equip language models with visual capabilities by aligning vision encoders with language models. Existing methods to enhance the visual perception of MLLMs often involve designing more powerful vision encoders, which requires exploring a vast design space and re-aligning each potential encoder with the language model, resulting in prohibitively high training costs. In this paper, we introduce VisionFuse, a novel integration framework that efficiently utilizes multiple vision encoders from off-the-shelf MLLMs to enhance visual perception without requiring additional training. Our approach is motivated by the observation that different MLLMs tend to focus on distinct regions given the same query and image. Moreover, we find that the feature distributions of vision encoders within an MLLM family, a group of MLLMs sharing the same pretrained LLM, are highly aligned. Building on these insights, VisionFuse enriches the visual context by concatenating the tokens generated by the vision encoders of selected MLLMs within a family. By merging the parameters of language models from these MLLMs, VisionFuse allows a single language model to align with various vision encoders, significantly reducing deployment overhead. We conduct comprehensive evaluations across multiple multimodal benchmarks using various MLLM combinations, demonstrating substantial improvements in multimodal tasks. Notably, when integrating MiniGemini-8B and SLIME-8B, VisionFuse achieves an average performance increase of over 4%.</description><identifier>DOI: 10.48550/arxiv.2412.01289</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-12</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2412.01289$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2412.01289$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Chen, Zhuokun</creatorcontrib><creatorcontrib>Hu, Jinwu</creatorcontrib><creatorcontrib>Deng, Zeshuai</creatorcontrib><creatorcontrib>Wang, Yufeng</creatorcontrib><creatorcontrib>Zhuang, Bohan</creatorcontrib><creatorcontrib>Tan, Mingkui</creatorcontrib><title>Enhancing Perception Capabilities of Multimodal LLMs with Training-Free Fusion</title><description>Multimodal LLMs (MLLMs) equip language models with visual capabilities by aligning vision encoders with language models. Existing methods to enhance the visual perception of MLLMs often involve designing more powerful vision encoders, which requires exploring a vast design space and re-aligning each potential encoder with the language model, resulting in prohibitively high training costs. In this paper, we introduce VisionFuse, a novel integration framework that efficiently utilizes multiple vision encoders from off-the-shelf MLLMs to enhance visual perception without requiring additional training. Our approach is motivated by the observation that different MLLMs tend to focus on distinct regions given the same query and image. Moreover, we find that the feature distributions of vision encoders within an MLLM family, a group of MLLMs sharing the same pretrained LLM, are highly aligned. Building on these insights, VisionFuse enriches the visual context by concatenating the tokens generated by the vision encoders of selected MLLMs within a family. By merging the parameters of language models from these MLLMs, VisionFuse allows a single language model to align with various vision encoders, significantly reducing deployment overhead. We conduct comprehensive evaluations across multiple multimodal benchmarks using various MLLM combinations, demonstrating substantial improvements in multimodal tasks. Notably, when integrating MiniGemini-8B and SLIME-8B, VisionFuse achieves an average performance increase of over 4%.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjE00jMwNLKw5GTwc83LSMxLzsxLVwhILUpOLSjJzM9TcE4sSEzKzMksyUwtVshPU_AtzSnJzM1PScxR8PHxLVYozyzJUAgpSszMA2rUdStKTVVwKy0G6uRhYE1LzClO5YXS3Azybq4hzh66YJvjC4oycxOLKuNBLogHu8CYsAoAZsY7pw</recordid><startdate>20241202</startdate><enddate>20241202</enddate><creator>Chen, Zhuokun</creator><creator>Hu, Jinwu</creator><creator>Deng, Zeshuai</creator><creator>Wang, Yufeng</creator><creator>Zhuang, Bohan</creator><creator>Tan, Mingkui</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20241202</creationdate><title>Enhancing Perception Capabilities of Multimodal LLMs with Training-Free Fusion</title><author>Chen, Zhuokun ; Hu, Jinwu ; Deng, Zeshuai ; Wang, Yufeng ; Zhuang, Bohan ; Tan, Mingkui</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2412_012893</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Chen, Zhuokun</creatorcontrib><creatorcontrib>Hu, Jinwu</creatorcontrib><creatorcontrib>Deng, Zeshuai</creatorcontrib><creatorcontrib>Wang, Yufeng</creatorcontrib><creatorcontrib>Zhuang, Bohan</creatorcontrib><creatorcontrib>Tan, Mingkui</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Chen, Zhuokun</au><au>Hu, Jinwu</au><au>Deng, Zeshuai</au><au>Wang, Yufeng</au><au>Zhuang, Bohan</au><au>Tan, Mingkui</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Enhancing Perception Capabilities of Multimodal LLMs with Training-Free Fusion</atitle><date>2024-12-02</date><risdate>2024</risdate><abstract>Multimodal LLMs (MLLMs) equip language models with visual capabilities by aligning vision encoders with language models. Existing methods to enhance the visual perception of MLLMs often involve designing more powerful vision encoders, which requires exploring a vast design space and re-aligning each potential encoder with the language model, resulting in prohibitively high training costs. In this paper, we introduce VisionFuse, a novel integration framework that efficiently utilizes multiple vision encoders from off-the-shelf MLLMs to enhance visual perception without requiring additional training. Our approach is motivated by the observation that different MLLMs tend to focus on distinct regions given the same query and image. Moreover, we find that the feature distributions of vision encoders within an MLLM family, a group of MLLMs sharing the same pretrained LLM, are highly aligned. Building on these insights, VisionFuse enriches the visual context by concatenating the tokens generated by the vision encoders of selected MLLMs within a family. By merging the parameters of language models from these MLLMs, VisionFuse allows a single language model to align with various vision encoders, significantly reducing deployment overhead. We conduct comprehensive evaluations across multiple multimodal benchmarks using various MLLM combinations, demonstrating substantial improvements in multimodal tasks. Notably, when integrating MiniGemini-8B and SLIME-8B, VisionFuse achieves an average performance increase of over 4%.</abstract><doi>10.48550/arxiv.2412.01289</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2412.01289
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2412_01289
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition
title	Enhancing Perception Capabilities of Multimodal LLMs with Training-Free Fusion
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T01%3A21%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Enhancing%20Perception%20Capabilities%20of%20Multimodal%20LLMs%20with%20Training-Free%20Fusion&rft.au=Chen,%20Zhuokun&rft.date=2024-12-02&rft_id=info:doi/10.48550/arxiv.2412.01289&rft_dat=%3Carxiv_GOX%3E2412_01289%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true