CogVLM2: Visual Language Models for Image and Video Understanding

Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-08
Hauptverfasser: Hong, Wenyi, Wang, Weihan, Ding, Ming, Yu, Wenmeng, Lv, Qingsong, Wang, Yan, Cheng, Yean, Huang, Shiyu, Ji, Junhui, Zhao, Xue, Zhao, Lei, Yang, Zhuoyi, Gu, Xiaotao, Zhang, Xiaohan, Feng, Guanyu, Yin, Da, Wang, Zihan, Ji Qi, Song, Xixuan, Zhang, Peng, Liu, Debing, Xu, Bin, Li, Juanzi, Dong, Yuxiao, Tang, Jie
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to \(1344 \times 1344\) pixels. As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. Notably, CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced in https://github.com/THUDM/CogVLM2 and https://github.com/THUDM/GLM-4, contributing to the advancement of the field.
ISSN:2331-8422