LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose LongVU, a spatiotemporal adaptive compre...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Shen, Xiaoqian, Xiong, Yunyang, Zhao, Changsheng, Wu, Lemeng, Chen, Jun, Zhu, Chenchen, Liu, Zechun, Xiao, Fanyi, Varadarajan, Balakrishnan, Bordes, Florian, Liu, Zhuang, Xu, Hu, Kim, Hyunwoo J, Soran, Bilge, Krishnamoorthi, Raghuraman, Elhoseiny, Mohamed, Chandra, Vikas
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Shen, Xiaoqian Xiong, Yunyang Zhao, Changsheng Wu, Lemeng Chen, Jun Zhu, Chenchen Liu, Zechun Xiao, Fanyi Varadarajan, Balakrishnan Bordes, Florian Liu, Zhuang Xu, Hu Kim, Hyunwoo J Soran, Bilge Krishnamoorthi, Raghuraman Elhoseiny, Mohamed Chandra, Vikas
description	Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism thats reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of frames with little visual information loss within given context length. Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.
doi_str_mv	10.48550/arxiv.2410.17434
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2410_17434</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2410_17434</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2410_174343</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMgEKGJqbGJtwMgT65Oelh4VaKQQXJJZk5pek5hbkFyXmKDimJBaUZJalKjjn5xYUpRYXZ-bnKaTlFymA1CuEZaak5uv6JOallyampyqE5qWkFhWXJOalZOal8zCwpiXmFKfyQmluBnk31xBnD12w5fEFRZm5iUWV8SBHxIMdYUxYBQBVVj0x</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding</title><source>arXiv.org</source><creator>Shen, Xiaoqian ; Xiong, Yunyang ; Zhao, Changsheng ; Wu, Lemeng ; Chen, Jun ; Zhu, Chenchen ; Liu, Zechun ; Xiao, Fanyi ; Varadarajan, Balakrishnan ; Bordes, Florian ; Liu, Zhuang ; Xu, Hu ; Kim, Hyunwoo J ; Soran, Bilge ; Krishnamoorthi, Raghuraman ; Elhoseiny, Mohamed ; Chandra, Vikas</creator><creatorcontrib>Shen, Xiaoqian ; Xiong, Yunyang ; Zhao, Changsheng ; Wu, Lemeng ; Chen, Jun ; Zhu, Chenchen ; Liu, Zechun ; Xiao, Fanyi ; Varadarajan, Balakrishnan ; Bordes, Florian ; Liu, Zhuang ; Xu, Hu ; Kim, Hyunwoo J ; Soran, Bilge ; Krishnamoorthi, Raghuraman ; Elhoseiny, Mohamed ; Chandra, Vikas</creatorcontrib><description>Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism thats reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of frames with little visual information loss within given context length. Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.</description><identifier>DOI: 10.48550/arxiv.2410.17434</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-10</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2410.17434$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2410.17434$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Shen, Xiaoqian</creatorcontrib><creatorcontrib>Xiong, Yunyang</creatorcontrib><creatorcontrib>Zhao, Changsheng</creatorcontrib><creatorcontrib>Wu, Lemeng</creatorcontrib><creatorcontrib>Chen, Jun</creatorcontrib><creatorcontrib>Zhu, Chenchen</creatorcontrib><creatorcontrib>Liu, Zechun</creatorcontrib><creatorcontrib>Xiao, Fanyi</creatorcontrib><creatorcontrib>Varadarajan, Balakrishnan</creatorcontrib><creatorcontrib>Bordes, Florian</creatorcontrib><creatorcontrib>Liu, Zhuang</creatorcontrib><creatorcontrib>Xu, Hu</creatorcontrib><creatorcontrib>Kim, Hyunwoo J</creatorcontrib><creatorcontrib>Soran, Bilge</creatorcontrib><creatorcontrib>Krishnamoorthi, Raghuraman</creatorcontrib><creatorcontrib>Elhoseiny, Mohamed</creatorcontrib><creatorcontrib>Chandra, Vikas</creatorcontrib><title>LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding</title><description>Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism thats reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of frames with little visual information loss within given context length. Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMgEKGJqbGJtwMgT65Oelh4VaKQQXJJZk5pek5hbkFyXmKDimJBaUZJalKjjn5xYUpRYXZ-bnKaTlFymA1CuEZaak5uv6JOallyampyqE5qWkFhWXJOalZOal8zCwpiXmFKfyQmluBnk31xBnD12w5fEFRZm5iUWV8SBHxIMdYUxYBQBVVj0x</recordid><startdate>20241022</startdate><enddate>20241022</enddate><creator>Shen, Xiaoqian</creator><creator>Xiong, Yunyang</creator><creator>Zhao, Changsheng</creator><creator>Wu, Lemeng</creator><creator>Chen, Jun</creator><creator>Zhu, Chenchen</creator><creator>Liu, Zechun</creator><creator>Xiao, Fanyi</creator><creator>Varadarajan, Balakrishnan</creator><creator>Bordes, Florian</creator><creator>Liu, Zhuang</creator><creator>Xu, Hu</creator><creator>Kim, Hyunwoo J</creator><creator>Soran, Bilge</creator><creator>Krishnamoorthi, Raghuraman</creator><creator>Elhoseiny, Mohamed</creator><creator>Chandra, Vikas</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20241022</creationdate><title>LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding</title><author>Shen, Xiaoqian ; Xiong, Yunyang ; Zhao, Changsheng ; Wu, Lemeng ; Chen, Jun ; Zhu, Chenchen ; Liu, Zechun ; Xiao, Fanyi ; Varadarajan, Balakrishnan ; Bordes, Florian ; Liu, Zhuang ; Xu, Hu ; Kim, Hyunwoo J ; Soran, Bilge ; Krishnamoorthi, Raghuraman ; Elhoseiny, Mohamed ; Chandra, Vikas</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2410_174343</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Shen, Xiaoqian</creatorcontrib><creatorcontrib>Xiong, Yunyang</creatorcontrib><creatorcontrib>Zhao, Changsheng</creatorcontrib><creatorcontrib>Wu, Lemeng</creatorcontrib><creatorcontrib>Chen, Jun</creatorcontrib><creatorcontrib>Zhu, Chenchen</creatorcontrib><creatorcontrib>Liu, Zechun</creatorcontrib><creatorcontrib>Xiao, Fanyi</creatorcontrib><creatorcontrib>Varadarajan, Balakrishnan</creatorcontrib><creatorcontrib>Bordes, Florian</creatorcontrib><creatorcontrib>Liu, Zhuang</creatorcontrib><creatorcontrib>Xu, Hu</creatorcontrib><creatorcontrib>Kim, Hyunwoo J</creatorcontrib><creatorcontrib>Soran, Bilge</creatorcontrib><creatorcontrib>Krishnamoorthi, Raghuraman</creatorcontrib><creatorcontrib>Elhoseiny, Mohamed</creatorcontrib><creatorcontrib>Chandra, Vikas</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Shen, Xiaoqian</au><au>Xiong, Yunyang</au><au>Zhao, Changsheng</au><au>Wu, Lemeng</au><au>Chen, Jun</au><au>Zhu, Chenchen</au><au>Liu, Zechun</au><au>Xiao, Fanyi</au><au>Varadarajan, Balakrishnan</au><au>Bordes, Florian</au><au>Liu, Zhuang</au><au>Xu, Hu</au><au>Kim, Hyunwoo J</au><au>Soran, Bilge</au><au>Krishnamoorthi, Raghuraman</au><au>Elhoseiny, Mohamed</au><au>Chandra, Vikas</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding</atitle><date>2024-10-22</date><risdate>2024</risdate><abstract>Multimodal Large Language Models (MLLMs) have shown promising progress in understanding and analyzing video content. However, processing long videos remains a significant challenge constrained by LLM's context size. To address this limitation, we propose LongVU, a spatiotemporal adaptive compression mechanism thats reduces the number of video tokens while preserving visual details of long videos. Our idea is based on leveraging cross-modal query and inter-frame dependencies to adaptively reduce temporal and spatial redundancy in videos. Specifically, we leverage DINOv2 features to remove redundant frames that exhibit high similarity. Then we utilize text-guided cross-modal query for selective frame feature reduction. Further, we perform spatial token reduction across frames based on their temporal dependencies. Our adaptive compression strategy effectively processes a large number of frames with little visual information loss within given context length. Our LongVU consistently surpass existing methods across a variety of video understanding benchmarks, especially on hour-long video understanding tasks such as VideoMME and MLVU. Given a light-weight LLM, our LongVU also scales effectively into a smaller size with state-of-the-art video understanding performance.</abstract><doi>10.48550/arxiv.2410.17434</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2410.17434
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2410_17434
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-06T10%3A30%3A45IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=LongVU:%20Spatiotemporal%20Adaptive%20Compression%20for%20Long%20Video-Language%20Understanding&rft.au=Shen,%20Xiaoqian&rft.date=2024-10-22&rft_id=info:doi/10.48550/arxiv.2410.17434&rft_dat=%3Carxiv_GOX%3E2410_17434%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true