TopViewRS: Vision-Language Models as Top-View Spatial Reasoners

Top-view perspective denotes a typical way in which humans read and reason over different types of maps, and it is vital for localization and navigation of humans as well as of `non-human' agents, such as the ones backed by large Vision-Language Models (VLMs). Nonetheless, spatial reasoning cap...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-06
Hauptverfasser:	Li, Chengzu, Zhang, Caiqi, Zhou, Han, Collier, Nigel, Korhonen, Anna, Vulić, Ivan
Format:	Artikel
Sprache:	eng
Schlagworte:	Cognition & reasoning Human performance Object recognition Reasoning Task complexity Vision
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Li, Chengzu Zhang, Caiqi Zhou, Han Collier, Nigel Korhonen, Anna Vulić, Ivan
description	Top-view perspective denotes a typical way in which humans read and reason over different types of maps, and it is vital for localization and navigation of humans as well as of `non-human' agents, such as the ones backed by large Vision-Language Models (VLMs). Nonetheless, spatial reasoning capabilities of modern VLMs remain unattested and underexplored. In this work, we thus study their capability to understand and reason over spatial relations from the top view. The focus on top view also enables controlled evaluations at different granularity of spatial reasoning; we clearly disentangle different abilities (e.g., recognizing particular objects versus understanding their relative positions). We introduce the TopViewRS (Top-View Reasoning in Space) dataset, consisting of 11,384 multiple-choice questions with either realistic or semantic top-view map as visual input. We then use it to study and evaluate VLMs across 4 perception and reasoning tasks with different levels of complexity. Evaluation of 10 representative open- and closed-source VLMs reveals the gap of more than 50% compared to average human performance, and it is even lower than the random baseline in some cases. Although additional experiments show that Chain-of-Thought reasoning can boost model capabilities by 5.82% on average, the overall performance of VLMs remains limited. Our findings underscore the critical need for enhanced model capability in top-view spatial reasoning and set a foundation for further research towards human-level proficiency of VLMs in real-world multimodal tasks.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3064736898</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3064736898</sourcerecordid><originalsourceid>FETCH-proquest_journals_30647368983</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSwD8kvCMtMLQ8KtlIIyyzOzM_T9UnMSy9NTE9V8M1PSc0pVkgsVgAq0gWpUgguSCzJTMxRCEpNLM7PSy0q5mFgTUvMKU7lhdLcDMpuriHOHroFRfmFpanFJfFZ-aVFeUCpeGMDMxNzYzMLSwtj4lQBAKA9OBk</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3064736898</pqid></control><display><type>article</type><title>TopViewRS: Vision-Language Models as Top-View Spatial Reasoners</title><source>Freely Accessible Journals</source><creator>Li, Chengzu ; Zhang, Caiqi ; Zhou, Han ; Collier, Nigel ; Korhonen, Anna ; Vulić, Ivan</creator><creatorcontrib>Li, Chengzu ; Zhang, Caiqi ; Zhou, Han ; Collier, Nigel ; Korhonen, Anna ; Vulić, Ivan</creatorcontrib><description>Top-view perspective denotes a typical way in which humans read and reason over different types of maps, and it is vital for localization and navigation of humans as well as of `non-human' agents, such as the ones backed by large Vision-Language Models (VLMs). Nonetheless, spatial reasoning capabilities of modern VLMs remain unattested and underexplored. In this work, we thus study their capability to understand and reason over spatial relations from the top view. The focus on top view also enables controlled evaluations at different granularity of spatial reasoning; we clearly disentangle different abilities (e.g., recognizing particular objects versus understanding their relative positions). We introduce the TopViewRS (Top-View Reasoning in Space) dataset, consisting of 11,384 multiple-choice questions with either realistic or semantic top-view map as visual input. We then use it to study and evaluate VLMs across 4 perception and reasoning tasks with different levels of complexity. Evaluation of 10 representative open- and closed-source VLMs reveals the gap of more than 50% compared to average human performance, and it is even lower than the random baseline in some cases. Although additional experiments show that Chain-of-Thought reasoning can boost model capabilities by 5.82% on average, the overall performance of VLMs remains limited. Our findings underscore the critical need for enhanced model capability in top-view spatial reasoning and set a foundation for further research towards human-level proficiency of VLMs in real-world multimodal tasks.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Cognition & reasoning ; Human performance ; Object recognition ; Reasoning ; Task complexity ; Vision</subject><ispartof>arXiv.org, 2024-06</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>781,785</link.rule.ids></links><search><creatorcontrib>Li, Chengzu</creatorcontrib><creatorcontrib>Zhang, Caiqi</creatorcontrib><creatorcontrib>Zhou, Han</creatorcontrib><creatorcontrib>Collier, Nigel</creatorcontrib><creatorcontrib>Korhonen, Anna</creatorcontrib><creatorcontrib>Vulić, Ivan</creatorcontrib><title>TopViewRS: Vision-Language Models as Top-View Spatial Reasoners</title><title>arXiv.org</title><description>Top-view perspective denotes a typical way in which humans read and reason over different types of maps, and it is vital for localization and navigation of humans as well as of `non-human' agents, such as the ones backed by large Vision-Language Models (VLMs). Nonetheless, spatial reasoning capabilities of modern VLMs remain unattested and underexplored. In this work, we thus study their capability to understand and reason over spatial relations from the top view. The focus on top view also enables controlled evaluations at different granularity of spatial reasoning; we clearly disentangle different abilities (e.g., recognizing particular objects versus understanding their relative positions). We introduce the TopViewRS (Top-View Reasoning in Space) dataset, consisting of 11,384 multiple-choice questions with either realistic or semantic top-view map as visual input. We then use it to study and evaluate VLMs across 4 perception and reasoning tasks with different levels of complexity. Evaluation of 10 representative open- and closed-source VLMs reveals the gap of more than 50% compared to average human performance, and it is even lower than the random baseline in some cases. Although additional experiments show that Chain-of-Thought reasoning can boost model capabilities by 5.82% on average, the overall performance of VLMs remains limited. Our findings underscore the critical need for enhanced model capability in top-view spatial reasoning and set a foundation for further research towards human-level proficiency of VLMs in real-world multimodal tasks.</description><subject>Cognition & reasoning</subject><subject>Human performance</subject><subject>Object recognition</subject><subject>Reasoning</subject><subject>Task complexity</subject><subject>Vision</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mSwD8kvCMtMLQ8KtlIIyyzOzM_T9UnMSy9NTE9V8M1PSc0pVkgsVgAq0gWpUgguSCzJTMxRCEpNLM7PSy0q5mFgTUvMKU7lhdLcDMpuriHOHroFRfmFpanFJfFZ-aVFeUCpeGMDMxNzYzMLSwtj4lQBAKA9OBk</recordid><startdate>20240604</startdate><enddate>20240604</enddate><creator>Li, Chengzu</creator><creator>Zhang, Caiqi</creator><creator>Zhou, Han</creator><creator>Collier, Nigel</creator><creator>Korhonen, Anna</creator><creator>Vulić, Ivan</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240604</creationdate><title>TopViewRS: Vision-Language Models as Top-View Spatial Reasoners</title><author>Li, Chengzu ; Zhang, Caiqi ; Zhou, Han ; Collier, Nigel ; Korhonen, Anna ; Vulić, Ivan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30647368983</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Cognition & reasoning</topic><topic>Human performance</topic><topic>Object recognition</topic><topic>Reasoning</topic><topic>Task complexity</topic><topic>Vision</topic><toplevel>online_resources</toplevel><creatorcontrib>Li, Chengzu</creatorcontrib><creatorcontrib>Zhang, Caiqi</creatorcontrib><creatorcontrib>Zhou, Han</creatorcontrib><creatorcontrib>Collier, Nigel</creatorcontrib><creatorcontrib>Korhonen, Anna</creatorcontrib><creatorcontrib>Vulić, Ivan</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Li, Chengzu</au><au>Zhang, Caiqi</au><au>Zhou, Han</au><au>Collier, Nigel</au><au>Korhonen, Anna</au><au>Vulić, Ivan</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>TopViewRS: Vision-Language Models as Top-View Spatial Reasoners</atitle><jtitle>arXiv.org</jtitle><date>2024-06-04</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Top-view perspective denotes a typical way in which humans read and reason over different types of maps, and it is vital for localization and navigation of humans as well as of `non-human' agents, such as the ones backed by large Vision-Language Models (VLMs). Nonetheless, spatial reasoning capabilities of modern VLMs remain unattested and underexplored. In this work, we thus study their capability to understand and reason over spatial relations from the top view. The focus on top view also enables controlled evaluations at different granularity of spatial reasoning; we clearly disentangle different abilities (e.g., recognizing particular objects versus understanding their relative positions). We introduce the TopViewRS (Top-View Reasoning in Space) dataset, consisting of 11,384 multiple-choice questions with either realistic or semantic top-view map as visual input. We then use it to study and evaluate VLMs across 4 perception and reasoning tasks with different levels of complexity. Evaluation of 10 representative open- and closed-source VLMs reveals the gap of more than 50% compared to average human performance, and it is even lower than the random baseline in some cases. Although additional experiments show that Chain-of-Thought reasoning can boost model capabilities by 5.82% on average, the overall performance of VLMs remains limited. Our findings underscore the critical need for enhanced model capability in top-view spatial reasoning and set a foundation for further research towards human-level proficiency of VLMs in real-world multimodal tasks.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-06
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3064736898
source	Freely Accessible Journals
subjects	Cognition & reasoning Human performance Object recognition Reasoning Task complexity Vision
title	TopViewRS: Vision-Language Models as Top-View Spatial Reasoners
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-15T12%3A24%3A21IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=TopViewRS:%20Vision-Language%20Models%20as%20Top-View%20Spatial%20Reasoners&rft.jtitle=arXiv.org&rft.au=Li,%20Chengzu&rft.date=2024-06-04&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3064736898%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3064736898&rft_id=info:pmid/&rfr_iscdi=true