Enhancing Perception Capabilities of Multimodal LLMs with Training-Free Fusion

Multimodal LLMs (MLLMs) equip language models with visual capabilities by aligning vision encoders with language models. Existing methods to enhance the visual perception of MLLMs often involve designing more powerful vision encoders, which requires exploring a vast design space and re-aligning each...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Chen, Zhuokun, Hu, Jinwu, Deng, Zeshuai, Wang, Yufeng, Zhuang, Bohan, Tan, Mingkui
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Chen, Zhuokun
Hu, Jinwu
Deng, Zeshuai
Wang, Yufeng
Zhuang, Bohan
Tan, Mingkui
description Multimodal LLMs (MLLMs) equip language models with visual capabilities by aligning vision encoders with language models. Existing methods to enhance the visual perception of MLLMs often involve designing more powerful vision encoders, which requires exploring a vast design space and re-aligning each potential encoder with the language model, resulting in prohibitively high training costs. In this paper, we introduce VisionFuse, a novel integration framework that efficiently utilizes multiple vision encoders from off-the-shelf MLLMs to enhance visual perception without requiring additional training. Our approach is motivated by the observation that different MLLMs tend to focus on distinct regions given the same query and image. Moreover, we find that the feature distributions of vision encoders within an MLLM family, a group of MLLMs sharing the same pretrained LLM, are highly aligned. Building on these insights, VisionFuse enriches the visual context by concatenating the tokens generated by the vision encoders of selected MLLMs within a family. By merging the parameters of language models from these MLLMs, VisionFuse allows a single language model to align with various vision encoders, significantly reducing deployment overhead. We conduct comprehensive evaluations across multiple multimodal benchmarks using various MLLM combinations, demonstrating substantial improvements in multimodal tasks. Notably, when integrating MiniGemini-8B and SLIME-8B, VisionFuse achieves an average performance increase of over 4%.
doi_str_mv 10.48550/arxiv.2412.01289
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2412_01289</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2412_01289</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2412_012893</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjE00jMwNLKw5GTwc83LSMxLzsxLVwhILUpOLSjJzM9TcE4sSEzKzMksyUwtVshPU_AtzSnJzM1PScxR8PHxLVYozyzJUAgpSszMA2rUdStKTVVwKy0G6uRhYE1LzClO5YXS3Azybq4hzh66YJvjC4oycxOLKuNBLogHu8CYsAoAZsY7pw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Enhancing Perception Capabilities of Multimodal LLMs with Training-Free Fusion</title><source>arXiv.org</source><creator>Chen, Zhuokun ; Hu, Jinwu ; Deng, Zeshuai ; Wang, Yufeng ; Zhuang, Bohan ; Tan, Mingkui</creator><creatorcontrib>Chen, Zhuokun ; Hu, Jinwu ; Deng, Zeshuai ; Wang, Yufeng ; Zhuang, Bohan ; Tan, Mingkui</creatorcontrib><description>Multimodal LLMs (MLLMs) equip language models with visual capabilities by aligning vision encoders with language models. Existing methods to enhance the visual perception of MLLMs often involve designing more powerful vision encoders, which requires exploring a vast design space and re-aligning each potential encoder with the language model, resulting in prohibitively high training costs. In this paper, we introduce VisionFuse, a novel integration framework that efficiently utilizes multiple vision encoders from off-the-shelf MLLMs to enhance visual perception without requiring additional training. Our approach is motivated by the observation that different MLLMs tend to focus on distinct regions given the same query and image. Moreover, we find that the feature distributions of vision encoders within an MLLM family, a group of MLLMs sharing the same pretrained LLM, are highly aligned. Building on these insights, VisionFuse enriches the visual context by concatenating the tokens generated by the vision encoders of selected MLLMs within a family. By merging the parameters of language models from these MLLMs, VisionFuse allows a single language model to align with various vision encoders, significantly reducing deployment overhead. We conduct comprehensive evaluations across multiple multimodal benchmarks using various MLLM combinations, demonstrating substantial improvements in multimodal tasks. Notably, when integrating MiniGemini-8B and SLIME-8B, VisionFuse achieves an average performance increase of over 4%.</description><identifier>DOI: 10.48550/arxiv.2412.01289</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-12</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2412.01289$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2412.01289$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Chen, Zhuokun</creatorcontrib><creatorcontrib>Hu, Jinwu</creatorcontrib><creatorcontrib>Deng, Zeshuai</creatorcontrib><creatorcontrib>Wang, Yufeng</creatorcontrib><creatorcontrib>Zhuang, Bohan</creatorcontrib><creatorcontrib>Tan, Mingkui</creatorcontrib><title>Enhancing Perception Capabilities of Multimodal LLMs with Training-Free Fusion</title><description>Multimodal LLMs (MLLMs) equip language models with visual capabilities by aligning vision encoders with language models. Existing methods to enhance the visual perception of MLLMs often involve designing more powerful vision encoders, which requires exploring a vast design space and re-aligning each potential encoder with the language model, resulting in prohibitively high training costs. In this paper, we introduce VisionFuse, a novel integration framework that efficiently utilizes multiple vision encoders from off-the-shelf MLLMs to enhance visual perception without requiring additional training. Our approach is motivated by the observation that different MLLMs tend to focus on distinct regions given the same query and image. Moreover, we find that the feature distributions of vision encoders within an MLLM family, a group of MLLMs sharing the same pretrained LLM, are highly aligned. Building on these insights, VisionFuse enriches the visual context by concatenating the tokens generated by the vision encoders of selected MLLMs within a family. By merging the parameters of language models from these MLLMs, VisionFuse allows a single language model to align with various vision encoders, significantly reducing deployment overhead. We conduct comprehensive evaluations across multiple multimodal benchmarks using various MLLM combinations, demonstrating substantial improvements in multimodal tasks. Notably, when integrating MiniGemini-8B and SLIME-8B, VisionFuse achieves an average performance increase of over 4%.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjE00jMwNLKw5GTwc83LSMxLzsxLVwhILUpOLSjJzM9TcE4sSEzKzMksyUwtVshPU_AtzSnJzM1PScxR8PHxLVYozyzJUAgpSszMA2rUdStKTVVwKy0G6uRhYE1LzClO5YXS3Azybq4hzh66YJvjC4oycxOLKuNBLogHu8CYsAoAZsY7pw</recordid><startdate>20241202</startdate><enddate>20241202</enddate><creator>Chen, Zhuokun</creator><creator>Hu, Jinwu</creator><creator>Deng, Zeshuai</creator><creator>Wang, Yufeng</creator><creator>Zhuang, Bohan</creator><creator>Tan, Mingkui</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20241202</creationdate><title>Enhancing Perception Capabilities of Multimodal LLMs with Training-Free Fusion</title><author>Chen, Zhuokun ; Hu, Jinwu ; Deng, Zeshuai ; Wang, Yufeng ; Zhuang, Bohan ; Tan, Mingkui</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2412_012893</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Chen, Zhuokun</creatorcontrib><creatorcontrib>Hu, Jinwu</creatorcontrib><creatorcontrib>Deng, Zeshuai</creatorcontrib><creatorcontrib>Wang, Yufeng</creatorcontrib><creatorcontrib>Zhuang, Bohan</creatorcontrib><creatorcontrib>Tan, Mingkui</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Chen, Zhuokun</au><au>Hu, Jinwu</au><au>Deng, Zeshuai</au><au>Wang, Yufeng</au><au>Zhuang, Bohan</au><au>Tan, Mingkui</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Enhancing Perception Capabilities of Multimodal LLMs with Training-Free Fusion</atitle><date>2024-12-02</date><risdate>2024</risdate><abstract>Multimodal LLMs (MLLMs) equip language models with visual capabilities by aligning vision encoders with language models. Existing methods to enhance the visual perception of MLLMs often involve designing more powerful vision encoders, which requires exploring a vast design space and re-aligning each potential encoder with the language model, resulting in prohibitively high training costs. In this paper, we introduce VisionFuse, a novel integration framework that efficiently utilizes multiple vision encoders from off-the-shelf MLLMs to enhance visual perception without requiring additional training. Our approach is motivated by the observation that different MLLMs tend to focus on distinct regions given the same query and image. Moreover, we find that the feature distributions of vision encoders within an MLLM family, a group of MLLMs sharing the same pretrained LLM, are highly aligned. Building on these insights, VisionFuse enriches the visual context by concatenating the tokens generated by the vision encoders of selected MLLMs within a family. By merging the parameters of language models from these MLLMs, VisionFuse allows a single language model to align with various vision encoders, significantly reducing deployment overhead. We conduct comprehensive evaluations across multiple multimodal benchmarks using various MLLM combinations, demonstrating substantial improvements in multimodal tasks. Notably, when integrating MiniGemini-8B and SLIME-8B, VisionFuse achieves an average performance increase of over 4%.</abstract><doi>10.48550/arxiv.2412.01289</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2412.01289
ispartof
issn
language eng
recordid cdi_arxiv_primary_2412_01289
source arXiv.org
subjects Computer Science - Artificial Intelligence
Computer Science - Computer Vision and Pattern Recognition
title Enhancing Perception Capabilities of Multimodal LLMs with Training-Free Fusion
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T01%3A21%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Enhancing%20Perception%20Capabilities%20of%20Multimodal%20LLMs%20with%20Training-Free%20Fusion&rft.au=Chen,%20Zhuokun&rft.date=2024-12-02&rft_id=info:doi/10.48550/arxiv.2412.01289&rft_dat=%3Carxiv_GOX%3E2412_01289%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true