CogVLM2: Visual Language Models for Image and Video Understanding

Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-08
Hauptverfasser: Hong, Wenyi, Wang, Weihan, Ding, Ming, Yu, Wenmeng, Lv, Qingsong, Wang, Yan, Cheng, Yean, Huang, Shiyu, Ji, Junhui, Zhao, Xue, Zhao, Lei, Yang, Zhuoyi, Gu, Xiaotao, Zhang, Xiaohan, Feng, Guanyu, Yin, Da, Wang, Zihan, Ji Qi, Song, Xixuan, Zhang, Peng, Liu, Debing, Xu, Bin, Li, Juanzi, Dong, Yuxiao, Tang, Jie
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Hong, Wenyi
Wang, Weihan
Ding, Ming
Yu, Wenmeng
Lv, Qingsong
Wang, Yan
Cheng, Yean
Huang, Shiyu
Ji, Junhui
Zhao, Xue
Zhao, Lei
Yang, Zhuoyi
Gu, Xiaotao
Zhang, Xiaohan
Feng, Guanyu
Yin, Da
Wang, Zihan
Ji Qi
Song, Xixuan
Zhang, Peng
Liu, Debing
Xu, Bin
Li, Juanzi
Dong, Yuxiao
Tang, Jie
description Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to \(1344 \times 1344\) pixels. As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. Notably, CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced in https://github.com/THUDM/CogVLM2 and https://github.com/THUDM/GLM-4, contributing to the advancement of the field.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3098942810</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3098942810</sourcerecordid><originalsourceid>FETCH-proquest_journals_30989428103</originalsourceid><addsrcrecordid>eNqNitEKgjAUQEcQJOU_DHoW5p2W9hZSFOhb-iqDzaGsrXbd_2fQB_R04JyzIhFwniZFBrAhMeLEGIPDEfKcR-RcOd3VDZxoN2IQhtbC6iC0oo2TyiAdnKf351cIK5dJKkdbK5XHeRGj1TuyHoRBFf-4Jfvr5VHdkpd376Bw7icXvF1Sz1lZlBkUKeP_XR_Imjg4</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3098942810</pqid></control><display><type>article</type><title>CogVLM2: Visual Language Models for Image and Video Understanding</title><source>Free E- Journals</source><creator>Hong, Wenyi ; Wang, Weihan ; Ding, Ming ; Yu, Wenmeng ; Lv, Qingsong ; Wang, Yan ; Cheng, Yean ; Huang, Shiyu ; Ji, Junhui ; Zhao, Xue ; Zhao, Lei ; Yang, Zhuoyi ; Gu, Xiaotao ; Zhang, Xiaohan ; Feng, Guanyu ; Yin, Da ; Wang, Zihan ; Ji Qi ; Song, Xixuan ; Zhang, Peng ; Liu, Debing ; Xu, Bin ; Li, Juanzi ; Dong, Yuxiao ; Tang, Jie</creator><creatorcontrib>Hong, Wenyi ; Wang, Weihan ; Ding, Ming ; Yu, Wenmeng ; Lv, Qingsong ; Wang, Yan ; Cheng, Yean ; Huang, Shiyu ; Ji, Junhui ; Zhao, Xue ; Zhao, Lei ; Yang, Zhuoyi ; Gu, Xiaotao ; Zhang, Xiaohan ; Feng, Guanyu ; Yin, Da ; Wang, Zihan ; Ji Qi ; Song, Xixuan ; Zhang, Peng ; Liu, Debing ; Xu, Bin ; Li, Juanzi ; Dong, Yuxiao ; Tang, Jie</creatorcontrib><description>Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to \(1344 \times 1344\) pixels. As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. Notably, CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced in https://github.com/THUDM/CogVLM2 and https://github.com/THUDM/GLM-4, contributing to the advancement of the field.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Enhanced vision ; Image enhancement</subject><ispartof>arXiv.org, 2024-08</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Hong, Wenyi</creatorcontrib><creatorcontrib>Wang, Weihan</creatorcontrib><creatorcontrib>Ding, Ming</creatorcontrib><creatorcontrib>Yu, Wenmeng</creatorcontrib><creatorcontrib>Lv, Qingsong</creatorcontrib><creatorcontrib>Wang, Yan</creatorcontrib><creatorcontrib>Cheng, Yean</creatorcontrib><creatorcontrib>Huang, Shiyu</creatorcontrib><creatorcontrib>Ji, Junhui</creatorcontrib><creatorcontrib>Zhao, Xue</creatorcontrib><creatorcontrib>Zhao, Lei</creatorcontrib><creatorcontrib>Yang, Zhuoyi</creatorcontrib><creatorcontrib>Gu, Xiaotao</creatorcontrib><creatorcontrib>Zhang, Xiaohan</creatorcontrib><creatorcontrib>Feng, Guanyu</creatorcontrib><creatorcontrib>Yin, Da</creatorcontrib><creatorcontrib>Wang, Zihan</creatorcontrib><creatorcontrib>Ji Qi</creatorcontrib><creatorcontrib>Song, Xixuan</creatorcontrib><creatorcontrib>Zhang, Peng</creatorcontrib><creatorcontrib>Liu, Debing</creatorcontrib><creatorcontrib>Xu, Bin</creatorcontrib><creatorcontrib>Li, Juanzi</creatorcontrib><creatorcontrib>Dong, Yuxiao</creatorcontrib><creatorcontrib>Tang, Jie</creatorcontrib><title>CogVLM2: Visual Language Models for Image and Video Understanding</title><title>arXiv.org</title><description>Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to \(1344 \times 1344\) pixels. As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. Notably, CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced in https://github.com/THUDM/CogVLM2 and https://github.com/THUDM/GLM-4, contributing to the advancement of the field.</description><subject>Enhanced vision</subject><subject>Image enhancement</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNitEKgjAUQEcQJOU_DHoW5p2W9hZSFOhb-iqDzaGsrXbd_2fQB_R04JyzIhFwniZFBrAhMeLEGIPDEfKcR-RcOd3VDZxoN2IQhtbC6iC0oo2TyiAdnKf351cIK5dJKkdbK5XHeRGj1TuyHoRBFf-4Jfvr5VHdkpd376Bw7icXvF1Sz1lZlBkUKeP_XR_Imjg4</recordid><startdate>20240829</startdate><enddate>20240829</enddate><creator>Hong, Wenyi</creator><creator>Wang, Weihan</creator><creator>Ding, Ming</creator><creator>Yu, Wenmeng</creator><creator>Lv, Qingsong</creator><creator>Wang, Yan</creator><creator>Cheng, Yean</creator><creator>Huang, Shiyu</creator><creator>Ji, Junhui</creator><creator>Zhao, Xue</creator><creator>Zhao, Lei</creator><creator>Yang, Zhuoyi</creator><creator>Gu, Xiaotao</creator><creator>Zhang, Xiaohan</creator><creator>Feng, Guanyu</creator><creator>Yin, Da</creator><creator>Wang, Zihan</creator><creator>Ji Qi</creator><creator>Song, Xixuan</creator><creator>Zhang, Peng</creator><creator>Liu, Debing</creator><creator>Xu, Bin</creator><creator>Li, Juanzi</creator><creator>Dong, Yuxiao</creator><creator>Tang, Jie</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240829</creationdate><title>CogVLM2: Visual Language Models for Image and Video Understanding</title><author>Hong, Wenyi ; Wang, Weihan ; Ding, Ming ; Yu, Wenmeng ; Lv, Qingsong ; Wang, Yan ; Cheng, Yean ; Huang, Shiyu ; Ji, Junhui ; Zhao, Xue ; Zhao, Lei ; Yang, Zhuoyi ; Gu, Xiaotao ; Zhang, Xiaohan ; Feng, Guanyu ; Yin, Da ; Wang, Zihan ; Ji Qi ; Song, Xixuan ; Zhang, Peng ; Liu, Debing ; Xu, Bin ; Li, Juanzi ; Dong, Yuxiao ; Tang, Jie</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30989428103</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Enhanced vision</topic><topic>Image enhancement</topic><toplevel>online_resources</toplevel><creatorcontrib>Hong, Wenyi</creatorcontrib><creatorcontrib>Wang, Weihan</creatorcontrib><creatorcontrib>Ding, Ming</creatorcontrib><creatorcontrib>Yu, Wenmeng</creatorcontrib><creatorcontrib>Lv, Qingsong</creatorcontrib><creatorcontrib>Wang, Yan</creatorcontrib><creatorcontrib>Cheng, Yean</creatorcontrib><creatorcontrib>Huang, Shiyu</creatorcontrib><creatorcontrib>Ji, Junhui</creatorcontrib><creatorcontrib>Zhao, Xue</creatorcontrib><creatorcontrib>Zhao, Lei</creatorcontrib><creatorcontrib>Yang, Zhuoyi</creatorcontrib><creatorcontrib>Gu, Xiaotao</creatorcontrib><creatorcontrib>Zhang, Xiaohan</creatorcontrib><creatorcontrib>Feng, Guanyu</creatorcontrib><creatorcontrib>Yin, Da</creatorcontrib><creatorcontrib>Wang, Zihan</creatorcontrib><creatorcontrib>Ji Qi</creatorcontrib><creatorcontrib>Song, Xixuan</creatorcontrib><creatorcontrib>Zhang, Peng</creatorcontrib><creatorcontrib>Liu, Debing</creatorcontrib><creatorcontrib>Xu, Bin</creatorcontrib><creatorcontrib>Li, Juanzi</creatorcontrib><creatorcontrib>Dong, Yuxiao</creatorcontrib><creatorcontrib>Tang, Jie</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Hong, Wenyi</au><au>Wang, Weihan</au><au>Ding, Ming</au><au>Yu, Wenmeng</au><au>Lv, Qingsong</au><au>Wang, Yan</au><au>Cheng, Yean</au><au>Huang, Shiyu</au><au>Ji, Junhui</au><au>Zhao, Xue</au><au>Zhao, Lei</au><au>Yang, Zhuoyi</au><au>Gu, Xiaotao</au><au>Zhang, Xiaohan</au><au>Feng, Guanyu</au><au>Yin, Da</au><au>Wang, Zihan</au><au>Ji Qi</au><au>Song, Xixuan</au><au>Zhang, Peng</au><au>Liu, Debing</au><au>Xu, Bin</au><au>Li, Juanzi</au><au>Dong, Yuxiao</au><au>Tang, Jie</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>CogVLM2: Visual Language Models for Image and Video Understanding</atitle><jtitle>arXiv.org</jtitle><date>2024-08-29</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications. Here we propose the CogVLM2 family, a new generation of visual language models for image and video understanding including CogVLM2, CogVLM2-Video and GLM-4V. As an image understanding model, CogVLM2 inherits the visual expert architecture with improved training recipes in both pre-training and post-training stages, supporting input resolution up to \(1344 \times 1344\) pixels. As a video understanding model, CogVLM2-Video integrates multi-frame input with timestamps and proposes automated temporal grounding data construction. Notably, CogVLM2 family has achieved state-of-the-art results on benchmarks like MMBench, MM-Vet, TextVQA, MVBench and VCGBench. All models are open-sourced in https://github.com/THUDM/CogVLM2 and https://github.com/THUDM/GLM-4, contributing to the advancement of the field.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-08
issn 2331-8422
language eng
recordid cdi_proquest_journals_3098942810
source Free E- Journals
subjects Enhanced vision
Image enhancement
title CogVLM2: Visual Language Models for Image and Video Understanding
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-06T23%3A34%3A10IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=CogVLM2:%20Visual%20Language%20Models%20for%20Image%20and%20Video%20Understanding&rft.jtitle=arXiv.org&rft.au=Hong,%20Wenyi&rft.date=2024-08-29&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3098942810%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3098942810&rft_id=info:pmid/&rfr_iscdi=true