Probing Fundamental Visual Comprehend Capabilities on Vision Language Models via Visual Phrases from Structural Data

Does the model demonstrate exceptional proficiency in “item counting,” “color recognition,” or other Fundamental Visual Comprehension Capability (FVCC)? There have been remarkable advancements in the field of multimodal, the pretrained general Vision Language Models exhibit strong performance across...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Cognitive computation 2024-11, Vol.16 (6), p.3484-3504
Hauptverfasser:	Xie, Peijin, Liu, Bingquan
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy Artificial Intelligence Benchmarks Cognition & reasoning Computation by Abstract Devices Computational Biology/Bioinformatics Computer Science Datasets Graphical representations Image enhancement Image reconstruction Language Large language models Linguistics Performance evaluation Semantics Vision Visual fields Visual tasks
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	3504
container_issue	6
container_start_page	3484
container_title	Cognitive computation
container_volume	16
creator	Xie, Peijin Liu, Bingquan
description	Does the model demonstrate exceptional proficiency in “item counting,” “color recognition,” or other Fundamental Visual Comprehension Capability (FVCC)? There have been remarkable advancements in the field of multimodal, the pretrained general Vision Language Models exhibit strong performance across a range of intricate Visual Language (VL) tasks and Multimodal Large Language Models (MLLMs) emerge novel visual reasoning abilities from several examples. But models tend to encounter difficulties when confronted with texts supplemented with specific details by simple visual phrases. Moreover, there is a scarcity of datasets in sufficient quantity, variety, and composability to enable the evaluation of each FVCC using statistical metrics. Accordingly, we decomposed the complete VL task into 9 M simple Visual Phrase Triplets (VPTs) across 16 categories representing 16 distinct FVCCs from the structural scene graph. Then, we reconstructed a Multilevel Scene Graph (MLSG) for each image and introduced our unbiased, balanced, and binary Visual Phrase Entailment benchmark with 20 times the data volume of SNLI-VE. The benchmark consisted of three exams and evaluated the performance of 8 widely used VLM and 10 MLLMs respectively. The results demonstrate the performance of each model across 16 classes in FVCC, as well as their lower and upper limits under conditions of increased text complexity or unnoised image input. Finally, we enhanced the efficiency of MLLM and evoked their In-Context Learning characteristics by appending multiple VPT generated QA pairs of identical types to the conversation history without tuning. The proposed structural VPTs and MLSG data hold promise for facilitating future explorations on FVCC.
doi_str_mv	10.1007/s12559-024-10351-8
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_3125874941</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3125874941</sourcerecordid><originalsourceid>FETCH-LOGICAL-c200t-76e13b7101c15b1a9246c82df9a89268610bf6b32daf6cf209af6f82cf55b8963</originalsourceid><addsrcrecordid>eNp9kMtKxDAUhosoOI6-gKuC62qSNmmylHqFEQe8bEPaJp0MbVKTVPDtzVgvO1f_4fD958CXJKcQnEMAygsPEcYsA6jIIMgxzOhesoCUkIwxUuz_zpgcJkfebwEgmGG0SMLa2VqbLr2ZTCsGaYLo01ftpxiVHUYnN9K0aSVGUeteBy19as2O0DFWwnST6GT6YFvZ-_Rdi5_yeuOEj7BydkifgpuaMLm4vxJBHCcHSvRennznMnm5uX6u7rLV4-19dbnKGgRAyEoiYV6XEMAG4hoKhgrSUNQqJihDhBIIakXqHLVCkUYhwGIqihqFcU0ZyZfJ2Xx3dPZtkj7wrZ2ciS95HoXRsmAFjBSaqcZZ751UfHR6EO6DQ8B3dvlsl0e7_Msup7GUzyUfYdNJ93f6n9Yn6iN-JA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3125874941</pqid></control><display><type>article</type><title>Probing Fundamental Visual Comprehend Capabilities on Vision Language Models via Visual Phrases from Structural Data</title><source>SpringerNature Journals</source><creator>Xie, Peijin ; Liu, Bingquan</creator><creatorcontrib>Xie, Peijin ; Liu, Bingquan</creatorcontrib><description>Does the model demonstrate exceptional proficiency in “item counting,” “color recognition,” or other Fundamental Visual Comprehension Capability (FVCC)? There have been remarkable advancements in the field of multimodal, the pretrained general Vision Language Models exhibit strong performance across a range of intricate Visual Language (VL) tasks and Multimodal Large Language Models (MLLMs) emerge novel visual reasoning abilities from several examples. But models tend to encounter difficulties when confronted with texts supplemented with specific details by simple visual phrases. Moreover, there is a scarcity of datasets in sufficient quantity, variety, and composability to enable the evaluation of each FVCC using statistical metrics. Accordingly, we decomposed the complete VL task into 9 M simple Visual Phrase Triplets (VPTs) across 16 categories representing 16 distinct FVCCs from the structural scene graph. Then, we reconstructed a Multilevel Scene Graph (MLSG) for each image and introduced our unbiased, balanced, and binary Visual Phrase Entailment benchmark with 20 times the data volume of SNLI-VE. The benchmark consisted of three exams and evaluated the performance of 8 widely used VLM and 10 MLLMs respectively. The results demonstrate the performance of each model across 16 classes in FVCC, as well as their lower and upper limits under conditions of increased text complexity or unnoised image input. Finally, we enhanced the efficiency of MLLM and evoked their In-Context Learning characteristics by appending multiple VPT generated QA pairs of identical types to the conversation history without tuning. The proposed structural VPTs and MLSG data hold promise for facilitating future explorations on FVCC.</description><identifier>ISSN: 1866-9956</identifier><identifier>EISSN: 1866-9964</identifier><identifier>DOI: 10.1007/s12559-024-10351-8</identifier><language>eng</language><publisher>New York: Springer US</publisher><subject>Accuracy ; Artificial Intelligence ; Benchmarks ; Cognition & reasoning ; Computation by Abstract Devices ; Computational Biology/Bioinformatics ; Computer Science ; Datasets ; Graphical representations ; Image enhancement ; Image reconstruction ; Language ; Large language models ; Linguistics ; Performance evaluation ; Semantics ; Vision ; Visual fields ; Visual tasks</subject><ispartof>Cognitive computation, 2024-11, Vol.16 (6), p.3484-3504</ispartof><rights>The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c200t-76e13b7101c15b1a9246c82df9a89268610bf6b32daf6cf209af6f82cf55b8963</cites></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1007/s12559-024-10351-8$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://link.springer.com/10.1007/s12559-024-10351-8$$EHTML$$P50$$Gspringer$$H</linktohtml><link.rule.ids>314,780,784,27924,27925,41488,42557,51319</link.rule.ids></links><search><creatorcontrib>Xie, Peijin</creatorcontrib><creatorcontrib>Liu, Bingquan</creatorcontrib><title>Probing Fundamental Visual Comprehend Capabilities on Vision Language Models via Visual Phrases from Structural Data</title><title>Cognitive computation</title><addtitle>Cogn Comput</addtitle><description>Does the model demonstrate exceptional proficiency in “item counting,” “color recognition,” or other Fundamental Visual Comprehension Capability (FVCC)? There have been remarkable advancements in the field of multimodal, the pretrained general Vision Language Models exhibit strong performance across a range of intricate Visual Language (VL) tasks and Multimodal Large Language Models (MLLMs) emerge novel visual reasoning abilities from several examples. But models tend to encounter difficulties when confronted with texts supplemented with specific details by simple visual phrases. Moreover, there is a scarcity of datasets in sufficient quantity, variety, and composability to enable the evaluation of each FVCC using statistical metrics. Accordingly, we decomposed the complete VL task into 9 M simple Visual Phrase Triplets (VPTs) across 16 categories representing 16 distinct FVCCs from the structural scene graph. Then, we reconstructed a Multilevel Scene Graph (MLSG) for each image and introduced our unbiased, balanced, and binary Visual Phrase Entailment benchmark with 20 times the data volume of SNLI-VE. The benchmark consisted of three exams and evaluated the performance of 8 widely used VLM and 10 MLLMs respectively. The results demonstrate the performance of each model across 16 classes in FVCC, as well as their lower and upper limits under conditions of increased text complexity or unnoised image input. Finally, we enhanced the efficiency of MLLM and evoked their In-Context Learning characteristics by appending multiple VPT generated QA pairs of identical types to the conversation history without tuning. The proposed structural VPTs and MLSG data hold promise for facilitating future explorations on FVCC.</description><subject>Accuracy</subject><subject>Artificial Intelligence</subject><subject>Benchmarks</subject><subject>Cognition & reasoning</subject><subject>Computation by Abstract Devices</subject><subject>Computational Biology/Bioinformatics</subject><subject>Computer Science</subject><subject>Datasets</subject><subject>Graphical representations</subject><subject>Image enhancement</subject><subject>Image reconstruction</subject><subject>Language</subject><subject>Large language models</subject><subject>Linguistics</subject><subject>Performance evaluation</subject><subject>Semantics</subject><subject>Vision</subject><subject>Visual fields</subject><subject>Visual tasks</subject><issn>1866-9956</issn><issn>1866-9964</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><recordid>eNp9kMtKxDAUhosoOI6-gKuC62qSNmmylHqFEQe8bEPaJp0MbVKTVPDtzVgvO1f_4fD958CXJKcQnEMAygsPEcYsA6jIIMgxzOhesoCUkIwxUuz_zpgcJkfebwEgmGG0SMLa2VqbLr2ZTCsGaYLo01ftpxiVHUYnN9K0aSVGUeteBy19as2O0DFWwnST6GT6YFvZ-_Rdi5_yeuOEj7BydkifgpuaMLm4vxJBHCcHSvRennznMnm5uX6u7rLV4-19dbnKGgRAyEoiYV6XEMAG4hoKhgrSUNQqJihDhBIIakXqHLVCkUYhwGIqihqFcU0ZyZfJ2Xx3dPZtkj7wrZ2ciS95HoXRsmAFjBSaqcZZ751UfHR6EO6DQ8B3dvlsl0e7_Msup7GUzyUfYdNJ93f6n9Yn6iN-JA</recordid><startdate>20241101</startdate><enddate>20241101</enddate><creator>Xie, Peijin</creator><creator>Liu, Bingquan</creator><general>Springer US</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>JQ2</scope></search><sort><creationdate>20241101</creationdate><title>Probing Fundamental Visual Comprehend Capabilities on Vision Language Models via Visual Phrases from Structural Data</title><author>Xie, Peijin ; Liu, Bingquan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c200t-76e13b7101c15b1a9246c82df9a89268610bf6b32daf6cf209af6f82cf55b8963</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Artificial Intelligence</topic><topic>Benchmarks</topic><topic>Cognition & reasoning</topic><topic>Computation by Abstract Devices</topic><topic>Computational Biology/Bioinformatics</topic><topic>Computer Science</topic><topic>Datasets</topic><topic>Graphical representations</topic><topic>Image enhancement</topic><topic>Image reconstruction</topic><topic>Language</topic><topic>Large language models</topic><topic>Linguistics</topic><topic>Performance evaluation</topic><topic>Semantics</topic><topic>Vision</topic><topic>Visual fields</topic><topic>Visual tasks</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Xie, Peijin</creatorcontrib><creatorcontrib>Liu, Bingquan</creatorcontrib><collection>CrossRef</collection><collection>ProQuest Computer Science Collection</collection><jtitle>Cognitive computation</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Xie, Peijin</au><au>Liu, Bingquan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Probing Fundamental Visual Comprehend Capabilities on Vision Language Models via Visual Phrases from Structural Data</atitle><jtitle>Cognitive computation</jtitle><stitle>Cogn Comput</stitle><date>2024-11-01</date><risdate>2024</risdate><volume>16</volume><issue>6</issue><spage>3484</spage><epage>3504</epage><pages>3484-3504</pages><issn>1866-9956</issn><eissn>1866-9964</eissn><abstract>Does the model demonstrate exceptional proficiency in “item counting,” “color recognition,” or other Fundamental Visual Comprehension Capability (FVCC)? There have been remarkable advancements in the field of multimodal, the pretrained general Vision Language Models exhibit strong performance across a range of intricate Visual Language (VL) tasks and Multimodal Large Language Models (MLLMs) emerge novel visual reasoning abilities from several examples. But models tend to encounter difficulties when confronted with texts supplemented with specific details by simple visual phrases. Moreover, there is a scarcity of datasets in sufficient quantity, variety, and composability to enable the evaluation of each FVCC using statistical metrics. Accordingly, we decomposed the complete VL task into 9 M simple Visual Phrase Triplets (VPTs) across 16 categories representing 16 distinct FVCCs from the structural scene graph. Then, we reconstructed a Multilevel Scene Graph (MLSG) for each image and introduced our unbiased, balanced, and binary Visual Phrase Entailment benchmark with 20 times the data volume of SNLI-VE. The benchmark consisted of three exams and evaluated the performance of 8 widely used VLM and 10 MLLMs respectively. The results demonstrate the performance of each model across 16 classes in FVCC, as well as their lower and upper limits under conditions of increased text complexity or unnoised image input. Finally, we enhanced the efficiency of MLLM and evoked their In-Context Learning characteristics by appending multiple VPT generated QA pairs of identical types to the conversation history without tuning. The proposed structural VPTs and MLSG data hold promise for facilitating future explorations on FVCC.</abstract><cop>New York</cop><pub>Springer US</pub><doi>10.1007/s12559-024-10351-8</doi><tpages>21</tpages></addata></record>
fulltext	fulltext
identifier	ISSN: 1866-9956
ispartof	Cognitive computation, 2024-11, Vol.16 (6), p.3484-3504
issn	1866-9956 1866-9964
language	eng
recordid	cdi_proquest_journals_3125874941
source	SpringerNature Journals
subjects	Accuracy Artificial Intelligence Benchmarks Cognition & reasoning Computation by Abstract Devices Computational Biology/Bioinformatics Computer Science Datasets Graphical representations Image enhancement Image reconstruction Language Large language models Linguistics Performance evaluation Semantics Vision Visual fields Visual tasks
title	Probing Fundamental Visual Comprehend Capabilities on Vision Language Models via Visual Phrases from Structural Data
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-23T06%3A34%3A18IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Probing%20Fundamental%20Visual%20Comprehend%20Capabilities%20on%20Vision%20Language%20Models%20via%20Visual%20Phrases%20from%20Structural%20Data&rft.jtitle=Cognitive%20computation&rft.au=Xie,%20Peijin&rft.date=2024-11-01&rft.volume=16&rft.issue=6&rft.spage=3484&rft.epage=3504&rft.pages=3484-3504&rft.issn=1866-9956&rft.eissn=1866-9964&rft_id=info:doi/10.1007/s12559-024-10351-8&rft_dat=%3Cproquest_cross%3E3125874941%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3125874941&rft_id=info:pmid/&rfr_iscdi=true