EVLM: An Efficient Vision-Language Model for Visual Understanding
In the field of multi-modal language models, the majority of methods are built on an architecture similar to LLaVA. These models use a single-layer ViT feature as a visual prompt, directly feeding it into the language models alongside textual tokens. However, when dealing with long sequences of visu...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Chen, Kaibing Shen, Dong Zhong, Hanwen Zhong, Huasong Xia, Kui Xu, Di Yuan, Wei Hu, Yifei Wen, Bin Zhang, Tianke Liu, Changyi Fan, Dewen Xiao, Huihui Wu, Jiahong Yang, Fan Li, Size Zhang, Di |
description | In the field of multi-modal language models, the majority of methods are
built on an architecture similar to LLaVA. These models use a single-layer ViT
feature as a visual prompt, directly feeding it into the language models
alongside textual tokens. However, when dealing with long sequences of visual
signals or inputs such as videos, the self-attention mechanism of language
models can lead to significant computational overhead. Additionally, using
single-layer ViT features makes it challenging for large language models to
perceive visual signals fully. This paper proposes an efficient multi-modal
language model to minimize computational costs while enabling the model to
perceive visual signals as comprehensively as possible. Our method primarily
includes: (1) employing cross-attention to image-text interaction similar to
Flamingo. (2) utilize hierarchical ViT features. (3) introduce the Mixture of
Experts (MoE) mechanism to enhance model effectiveness. Our model achieves
competitive scores on public multi-modal benchmarks and performs well in tasks
such as image captioning and video captioning. |
doi_str_mv | 10.48550/arxiv.2407.14177 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2407_14177</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2407_14177</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2407_141773</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjEw1zM0MTQ352RwdA3z8bVScMxTcE1Ly0zOTM0rUQjLLM7Mz9P1ScxLL01MT1XwzU9JzVFIyy8CyZQm5iiE5qWkFhWXJOalZOal8zCwpiXmFKfyQmluBnk31xBnD12wZfEFRZm5iUWV8SBL48GWGhNWAQCmHjYt</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>EVLM: An Efficient Vision-Language Model for Visual Understanding</title><source>arXiv.org</source><creator>Chen, Kaibing ; Shen, Dong ; Zhong, Hanwen ; Zhong, Huasong ; Xia, Kui ; Xu, Di ; Yuan, Wei ; Hu, Yifei ; Wen, Bin ; Zhang, Tianke ; Liu, Changyi ; Fan, Dewen ; Xiao, Huihui ; Wu, Jiahong ; Yang, Fan ; Li, Size ; Zhang, Di</creator><creatorcontrib>Chen, Kaibing ; Shen, Dong ; Zhong, Hanwen ; Zhong, Huasong ; Xia, Kui ; Xu, Di ; Yuan, Wei ; Hu, Yifei ; Wen, Bin ; Zhang, Tianke ; Liu, Changyi ; Fan, Dewen ; Xiao, Huihui ; Wu, Jiahong ; Yang, Fan ; Li, Size ; Zhang, Di</creatorcontrib><description>In the field of multi-modal language models, the majority of methods are
built on an architecture similar to LLaVA. These models use a single-layer ViT
feature as a visual prompt, directly feeding it into the language models
alongside textual tokens. However, when dealing with long sequences of visual
signals or inputs such as videos, the self-attention mechanism of language
models can lead to significant computational overhead. Additionally, using
single-layer ViT features makes it challenging for large language models to
perceive visual signals fully. This paper proposes an efficient multi-modal
language model to minimize computational costs while enabling the model to
perceive visual signals as comprehensively as possible. Our method primarily
includes: (1) employing cross-attention to image-text interaction similar to
Flamingo. (2) utilize hierarchical ViT features. (3) introduce the Mixture of
Experts (MoE) mechanism to enhance model effectiveness. Our model achieves
competitive scores on public multi-modal benchmarks and performs well in tasks
such as image captioning and video captioning.</description><identifier>DOI: 10.48550/arxiv.2407.14177</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-07</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2407.14177$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2407.14177$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Chen, Kaibing</creatorcontrib><creatorcontrib>Shen, Dong</creatorcontrib><creatorcontrib>Zhong, Hanwen</creatorcontrib><creatorcontrib>Zhong, Huasong</creatorcontrib><creatorcontrib>Xia, Kui</creatorcontrib><creatorcontrib>Xu, Di</creatorcontrib><creatorcontrib>Yuan, Wei</creatorcontrib><creatorcontrib>Hu, Yifei</creatorcontrib><creatorcontrib>Wen, Bin</creatorcontrib><creatorcontrib>Zhang, Tianke</creatorcontrib><creatorcontrib>Liu, Changyi</creatorcontrib><creatorcontrib>Fan, Dewen</creatorcontrib><creatorcontrib>Xiao, Huihui</creatorcontrib><creatorcontrib>Wu, Jiahong</creatorcontrib><creatorcontrib>Yang, Fan</creatorcontrib><creatorcontrib>Li, Size</creatorcontrib><creatorcontrib>Zhang, Di</creatorcontrib><title>EVLM: An Efficient Vision-Language Model for Visual Understanding</title><description>In the field of multi-modal language models, the majority of methods are
built on an architecture similar to LLaVA. These models use a single-layer ViT
feature as a visual prompt, directly feeding it into the language models
alongside textual tokens. However, when dealing with long sequences of visual
signals or inputs such as videos, the self-attention mechanism of language
models can lead to significant computational overhead. Additionally, using
single-layer ViT features makes it challenging for large language models to
perceive visual signals fully. This paper proposes an efficient multi-modal
language model to minimize computational costs while enabling the model to
perceive visual signals as comprehensively as possible. Our method primarily
includes: (1) employing cross-attention to image-text interaction similar to
Flamingo. (2) utilize hierarchical ViT features. (3) introduce the Mixture of
Experts (MoE) mechanism to enhance model effectiveness. Our model achieves
competitive scores on public multi-modal benchmarks and performs well in tasks
such as image captioning and video captioning.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjEw1zM0MTQ352RwdA3z8bVScMxTcE1Ly0zOTM0rUQjLLM7Mz9P1ScxLL01MT1XwzU9JzVFIyy8CyZQm5iiE5qWkFhWXJOalZOal8zCwpiXmFKfyQmluBnk31xBnD12wZfEFRZm5iUWV8SBL48GWGhNWAQCmHjYt</recordid><startdate>20240719</startdate><enddate>20240719</enddate><creator>Chen, Kaibing</creator><creator>Shen, Dong</creator><creator>Zhong, Hanwen</creator><creator>Zhong, Huasong</creator><creator>Xia, Kui</creator><creator>Xu, Di</creator><creator>Yuan, Wei</creator><creator>Hu, Yifei</creator><creator>Wen, Bin</creator><creator>Zhang, Tianke</creator><creator>Liu, Changyi</creator><creator>Fan, Dewen</creator><creator>Xiao, Huihui</creator><creator>Wu, Jiahong</creator><creator>Yang, Fan</creator><creator>Li, Size</creator><creator>Zhang, Di</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240719</creationdate><title>EVLM: An Efficient Vision-Language Model for Visual Understanding</title><author>Chen, Kaibing ; Shen, Dong ; Zhong, Hanwen ; Zhong, Huasong ; Xia, Kui ; Xu, Di ; Yuan, Wei ; Hu, Yifei ; Wen, Bin ; Zhang, Tianke ; Liu, Changyi ; Fan, Dewen ; Xiao, Huihui ; Wu, Jiahong ; Yang, Fan ; Li, Size ; Zhang, Di</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2407_141773</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Chen, Kaibing</creatorcontrib><creatorcontrib>Shen, Dong</creatorcontrib><creatorcontrib>Zhong, Hanwen</creatorcontrib><creatorcontrib>Zhong, Huasong</creatorcontrib><creatorcontrib>Xia, Kui</creatorcontrib><creatorcontrib>Xu, Di</creatorcontrib><creatorcontrib>Yuan, Wei</creatorcontrib><creatorcontrib>Hu, Yifei</creatorcontrib><creatorcontrib>Wen, Bin</creatorcontrib><creatorcontrib>Zhang, Tianke</creatorcontrib><creatorcontrib>Liu, Changyi</creatorcontrib><creatorcontrib>Fan, Dewen</creatorcontrib><creatorcontrib>Xiao, Huihui</creatorcontrib><creatorcontrib>Wu, Jiahong</creatorcontrib><creatorcontrib>Yang, Fan</creatorcontrib><creatorcontrib>Li, Size</creatorcontrib><creatorcontrib>Zhang, Di</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Chen, Kaibing</au><au>Shen, Dong</au><au>Zhong, Hanwen</au><au>Zhong, Huasong</au><au>Xia, Kui</au><au>Xu, Di</au><au>Yuan, Wei</au><au>Hu, Yifei</au><au>Wen, Bin</au><au>Zhang, Tianke</au><au>Liu, Changyi</au><au>Fan, Dewen</au><au>Xiao, Huihui</au><au>Wu, Jiahong</au><au>Yang, Fan</au><au>Li, Size</au><au>Zhang, Di</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>EVLM: An Efficient Vision-Language Model for Visual Understanding</atitle><date>2024-07-19</date><risdate>2024</risdate><abstract>In the field of multi-modal language models, the majority of methods are
built on an architecture similar to LLaVA. These models use a single-layer ViT
feature as a visual prompt, directly feeding it into the language models
alongside textual tokens. However, when dealing with long sequences of visual
signals or inputs such as videos, the self-attention mechanism of language
models can lead to significant computational overhead. Additionally, using
single-layer ViT features makes it challenging for large language models to
perceive visual signals fully. This paper proposes an efficient multi-modal
language model to minimize computational costs while enabling the model to
perceive visual signals as comprehensively as possible. Our method primarily
includes: (1) employing cross-attention to image-text interaction similar to
Flamingo. (2) utilize hierarchical ViT features. (3) introduce the Mixture of
Experts (MoE) mechanism to enhance model effectiveness. Our model achieves
competitive scores on public multi-modal benchmarks and performs well in tasks
such as image captioning and video captioning.</abstract><doi>10.48550/arxiv.2407.14177</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2407.14177 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2407_14177 |
source | arXiv.org |
subjects | Computer Science - Computer Vision and Pattern Recognition |
title | EVLM: An Efficient Vision-Language Model for Visual Understanding |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-01T04%3A26%3A22IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=EVLM:%20An%20Efficient%20Vision-Language%20Model%20for%20Visual%20Understanding&rft.au=Chen,%20Kaibing&rft.date=2024-07-19&rft_id=info:doi/10.48550/arxiv.2407.14177&rft_dat=%3Carxiv_GOX%3E2407_14177%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |