Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

Long video understanding poses a significant challenge for current Multi-modal Large Language Models (MLLMs). Notably, the MLLMs are constrained by their limited context lengths and the substantial costs while processing long videos. Although several existing methods attempt to reduce visual tokens,...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Shu, Yan, Liu, Zheng, Zhang, Peitian, Qin, Minghao, Zhou, Junjie, Liang, Zhengyang, Huang, Tiejun, Zhao, Bo
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!