Probing Fundamental Visual Comprehend Capabilities on Vision Language Models via Visual Phrases from Structural Data

Does the model demonstrate exceptional proficiency in “item counting,” “color recognition,” or other Fundamental Visual Comprehension Capability (FVCC)? There have been remarkable advancements in the field of multimodal, the pretrained general Vision Language Models exhibit strong performance across...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Cognitive computation 2024-11, Vol.16 (6), p.3484-3504
Hauptverfasser: Xie, Peijin, Liu, Bingquan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 3504
container_issue 6
container_start_page 3484
container_title Cognitive computation
container_volume 16
creator Xie, Peijin
Liu, Bingquan
description Does the model demonstrate exceptional proficiency in “item counting,” “color recognition,” or other Fundamental Visual Comprehension Capability (FVCC)? There have been remarkable advancements in the field of multimodal, the pretrained general Vision Language Models exhibit strong performance across a range of intricate Visual Language (VL) tasks and Multimodal Large Language Models (MLLMs) emerge novel visual reasoning abilities from several examples. But models tend to encounter difficulties when confronted with texts supplemented with specific details by simple visual phrases. Moreover, there is a scarcity of datasets in sufficient quantity, variety, and composability to enable the evaluation of each FVCC using statistical metrics. Accordingly, we decomposed the complete VL task into 9 M simple Visual Phrase Triplets (VPTs) across 16 categories representing 16 distinct FVCCs from the structural scene graph. Then, we reconstructed a Multilevel Scene Graph (MLSG) for each image and introduced our unbiased, balanced, and binary Visual Phrase Entailment benchmark with 20 times the data volume of SNLI-VE. The benchmark consisted of three exams and evaluated the performance of 8 widely used VLM and 10 MLLMs respectively. The results demonstrate the performance of each model across 16 classes in FVCC, as well as their lower and upper limits under conditions of increased text complexity or unnoised image input. Finally, we enhanced the efficiency of MLLM and evoked their In-Context Learning characteristics by appending multiple VPT generated QA pairs of identical types to the conversation history without tuning. The proposed structural VPTs and MLSG data hold promise for facilitating future explorations on FVCC.
doi_str_mv 10.1007/s12559-024-10351-8
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_3125874941</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3125874941</sourcerecordid><originalsourceid>FETCH-LOGICAL-c200t-76e13b7101c15b1a9246c82df9a89268610bf6b32daf6cf209af6f82cf55b8963</originalsourceid><addsrcrecordid>eNp9kMtKxDAUhosoOI6-gKuC62qSNmmylHqFEQe8bEPaJp0MbVKTVPDtzVgvO1f_4fD958CXJKcQnEMAygsPEcYsA6jIIMgxzOhesoCUkIwxUuz_zpgcJkfebwEgmGG0SMLa2VqbLr2ZTCsGaYLo01ftpxiVHUYnN9K0aSVGUeteBy19as2O0DFWwnST6GT6YFvZ-_Rdi5_yeuOEj7BydkifgpuaMLm4vxJBHCcHSvRennznMnm5uX6u7rLV4-19dbnKGgRAyEoiYV6XEMAG4hoKhgrSUNQqJihDhBIIakXqHLVCkUYhwGIqihqFcU0ZyZfJ2Xx3dPZtkj7wrZ2ciS95HoXRsmAFjBSaqcZZ751UfHR6EO6DQ8B3dvlsl0e7_Msup7GUzyUfYdNJ93f6n9Yn6iN-JA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3125874941</pqid></control><display><type>article</type><title>Probing Fundamental Visual Comprehend Capabilities on Vision Language Models via Visual Phrases from Structural Data</title><source>SpringerNature Journals</source><creator>Xie, Peijin ; Liu, Bingquan</creator><creatorcontrib>Xie, Peijin ; Liu, Bingquan</creatorcontrib><description>Does the model demonstrate exceptional proficiency in “item counting,” “color recognition,” or other Fundamental Visual Comprehension Capability (FVCC)? There have been remarkable advancements in the field of multimodal, the pretrained general Vision Language Models exhibit strong performance across a range of intricate Visual Language (VL) tasks and Multimodal Large Language Models (MLLMs) emerge novel visual reasoning abilities from several examples. But models tend to encounter difficulties when confronted with texts supplemented with specific details by simple visual phrases. Moreover, there is a scarcity of datasets in sufficient quantity, variety, and composability to enable the evaluation of each FVCC using statistical metrics. Accordingly, we decomposed the complete VL task into 9 M simple Visual Phrase Triplets (VPTs) across 16 categories representing 16 distinct FVCCs from the structural scene graph. Then, we reconstructed a Multilevel Scene Graph (MLSG) for each image and introduced our unbiased, balanced, and binary Visual Phrase Entailment benchmark with 20 times the data volume of SNLI-VE. The benchmark consisted of three exams and evaluated the performance of 8 widely used VLM and 10 MLLMs respectively. The results demonstrate the performance of each model across 16 classes in FVCC, as well as their lower and upper limits under conditions of increased text complexity or unnoised image input. Finally, we enhanced the efficiency of MLLM and evoked their In-Context Learning characteristics by appending multiple VPT generated QA pairs of identical types to the conversation history without tuning. The proposed structural VPTs and MLSG data hold promise for facilitating future explorations on FVCC.</description><identifier>ISSN: 1866-9956</identifier><identifier>EISSN: 1866-9964</identifier><identifier>DOI: 10.1007/s12559-024-10351-8</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Accuracy ; Artificial Intelligence ; Benchmarks ; Cognition &amp; reasoning ; Computation by Abstract Devices ; Computational Biology/Bioinformatics ; Computer Science ; Datasets ; Graphical representations ; Image enhancement ; Image reconstruction ; Language ; Large language models ; Linguistics ; Performance evaluation ; Semantics ; Vision ; Visual fields ; Visual tasks</subject><ispartof>Cognitive computation, 2024-11, Vol.16 (6), p.3484-3504</ispartof><rights>The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c200t-76e13b7101c15b1a9246c82df9a89268610bf6b32daf6cf209af6f82cf55b8963</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s12559-024-10351-8$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s12559-024-10351-8$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,41488,42557,51319</link.rule.ids></links><search><creatorcontrib>Xie, Peijin</creatorcontrib><creatorcontrib>Liu, Bingquan</creatorcontrib><title>Probing Fundamental Visual Comprehend Capabilities on Vision Language Models via Visual Phrases from Structural Data</title><title>Cognitive computation</title><addtitle>Cogn Comput</addtitle><description>Does the model demonstrate exceptional proficiency in “item counting,” “color recognition,” or other Fundamental Visual Comprehension Capability (FVCC)? There have been remarkable advancements in the field of multimodal, the pretrained general Vision Language Models exhibit strong performance across a range of intricate Visual Language (VL) tasks and Multimodal Large Language Models (MLLMs) emerge novel visual reasoning abilities from several examples. But models tend to encounter difficulties when confronted with texts supplemented with specific details by simple visual phrases. Moreover, there is a scarcity of datasets in sufficient quantity, variety, and composability to enable the evaluation of each FVCC using statistical metrics. Accordingly, we decomposed the complete VL task into 9 M simple Visual Phrase Triplets (VPTs) across 16 categories representing 16 distinct FVCCs from the structural scene graph. Then, we reconstructed a Multilevel Scene Graph (MLSG) for each image and introduced our unbiased, balanced, and binary Visual Phrase Entailment benchmark with 20 times the data volume of SNLI-VE. The benchmark consisted of three exams and evaluated the performance of 8 widely used VLM and 10 MLLMs respectively. The results demonstrate the performance of each model across 16 classes in FVCC, as well as their lower and upper limits under conditions of increased text complexity or unnoised image input. Finally, we enhanced the efficiency of MLLM and evoked their In-Context Learning characteristics by appending multiple VPT generated QA pairs of identical types to the conversation history without tuning. The proposed structural VPTs and MLSG data hold promise for facilitating future explorations on FVCC.</description><subject>Accuracy</subject><subject>Artificial Intelligence</subject><subject>Benchmarks</subject><subject>Cognition &amp; reasoning</subject><subject>Computation by Abstract Devices</subject><subject>Computational Biology/Bioinformatics</subject><subject>Computer Science</subject><subject>Datasets</subject><subject>Graphical representations</subject><subject>Image enhancement</subject><subject>Image reconstruction</subject><subject>Language</subject><subject>Large language models</subject><subject>Linguistics</subject><subject>Performance evaluation</subject><subject>Semantics</subject><subject>Vision</subject><subject>Visual fields</subject><subject>Visual tasks</subject><issn>1866-9956</issn><issn>1866-9964</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNp9kMtKxDAUhosoOI6-gKuC62qSNmmylHqFEQe8bEPaJp0MbVKTVPDtzVgvO1f_4fD958CXJKcQnEMAygsPEcYsA6jIIMgxzOhesoCUkIwxUuz_zpgcJkfebwEgmGG0SMLa2VqbLr2ZTCsGaYLo01ftpxiVHUYnN9K0aSVGUeteBy19as2O0DFWwnST6GT6YFvZ-_Rdi5_yeuOEj7BydkifgpuaMLm4vxJBHCcHSvRennznMnm5uX6u7rLV4-19dbnKGgRAyEoiYV6XEMAG4hoKhgrSUNQqJihDhBIIakXqHLVCkUYhwGIqihqFcU0ZyZfJ2Xx3dPZtkj7wrZ2ciS95HoXRsmAFjBSaqcZZ751UfHR6EO6DQ8B3dvlsl0e7_Msup7GUzyUfYdNJ93f6n9Yn6iN-JA</recordid><startdate>20241101</startdate><enddate>20241101</enddate><creator>Xie, Peijin</creator><creator>Liu, Bingquan</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>JQ2</scope></search><sort><creationdate>20241101</creationdate><title>Probing Fundamental Visual Comprehend Capabilities on Vision Language Models via Visual Phrases from Structural Data</title><author>Xie, Peijin ; Liu, Bingquan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c200t-76e13b7101c15b1a9246c82df9a89268610bf6b32daf6cf209af6f82cf55b8963</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Artificial Intelligence</topic><topic>Benchmarks</topic><topic>Cognition &amp; reasoning</topic><topic>Computation by Abstract Devices</topic><topic>Computational Biology/Bioinformatics</topic><topic>Computer Science</topic><topic>Datasets</topic><topic>Graphical representations</topic><topic>Image enhancement</topic><topic>Image reconstruction</topic><topic>Language</topic><topic>Large language models</topic><topic>Linguistics</topic><topic>Performance evaluation</topic><topic>Semantics</topic><topic>Vision</topic><topic>Visual fields</topic><topic>Visual tasks</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Xie, Peijin</creatorcontrib><creatorcontrib>Liu, Bingquan</creatorcontrib><collection>CrossRef</collection><collection>ProQuest Computer Science Collection</collection><jtitle>Cognitive computation</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Xie, Peijin</au><au>Liu, Bingquan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Probing Fundamental Visual Comprehend Capabilities on Vision Language Models via Visual Phrases from Structural Data</atitle><jtitle>Cognitive computation</jtitle><stitle>Cogn Comput</stitle><date>2024-11-01</date><risdate>2024</risdate><volume>16</volume><issue>6</issue><spage>3484</spage><epage>3504</epage><pages>3484-3504</pages><issn>1866-9956</issn><eissn>1866-9964</eissn><abstract>Does the model demonstrate exceptional proficiency in “item counting,” “color recognition,” or other Fundamental Visual Comprehension Capability (FVCC)? There have been remarkable advancements in the field of multimodal, the pretrained general Vision Language Models exhibit strong performance across a range of intricate Visual Language (VL) tasks and Multimodal Large Language Models (MLLMs) emerge novel visual reasoning abilities from several examples. But models tend to encounter difficulties when confronted with texts supplemented with specific details by simple visual phrases. Moreover, there is a scarcity of datasets in sufficient quantity, variety, and composability to enable the evaluation of each FVCC using statistical metrics. Accordingly, we decomposed the complete VL task into 9 M simple Visual Phrase Triplets (VPTs) across 16 categories representing 16 distinct FVCCs from the structural scene graph. Then, we reconstructed a Multilevel Scene Graph (MLSG) for each image and introduced our unbiased, balanced, and binary Visual Phrase Entailment benchmark with 20 times the data volume of SNLI-VE. The benchmark consisted of three exams and evaluated the performance of 8 widely used VLM and 10 MLLMs respectively. The results demonstrate the performance of each model across 16 classes in FVCC, as well as their lower and upper limits under conditions of increased text complexity or unnoised image input. Finally, we enhanced the efficiency of MLLM and evoked their In-Context Learning characteristics by appending multiple VPT generated QA pairs of identical types to the conversation history without tuning. The proposed structural VPTs and MLSG data hold promise for facilitating future explorations on FVCC.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s12559-024-10351-8</doi><tpages>21</tpages></addata></record>
fulltext fulltext
identifier ISSN: 1866-9956
ispartof Cognitive computation, 2024-11, Vol.16 (6), p.3484-3504
issn 1866-9956
1866-9964
language eng
recordid cdi_proquest_journals_3125874941
source SpringerNature Journals
subjects Accuracy
Artificial Intelligence
Benchmarks
Cognition & reasoning
Computation by Abstract Devices
Computational Biology/Bioinformatics
Computer Science
Datasets
Graphical representations
Image enhancement
Image reconstruction
Language
Large language models
Linguistics
Performance evaluation
Semantics
Vision
Visual fields
Visual tasks
title Probing Fundamental Visual Comprehend Capabilities on Vision Language Models via Visual Phrases from Structural Data
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-23T06%3A34%3A18IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Probing%20Fundamental%20Visual%20Comprehend%20Capabilities%20on%20Vision%20Language%20Models%20via%20Visual%20Phrases%20from%20Structural%20Data&rft.jtitle=Cognitive%20computation&rft.au=Xie,%20Peijin&rft.date=2024-11-01&rft.volume=16&rft.issue=6&rft.spage=3484&rft.epage=3504&rft.pages=3484-3504&rft.issn=1866-9956&rft.eissn=1866-9964&rft_id=info:doi/10.1007/s12559-024-10351-8&rft_dat=%3Cproquest_cross%3E3125874941%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3125874941&rft_id=info:pmid/&rfr_iscdi=true