GameVLM: A Decision-making Framework for Robotic Task Planning Based on Visual Language Models and Zero-sum Games

With their prominent scene understanding and reasoning capabilities, pre-trained visual-language models (VLMs) such as GPT-4V have attracted increasing attention in robotic task planning. Compared with traditional task planning strategies, VLMs are strong in multimodal information parsing and code g...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Mei, Aoran, Wang, Jianhua, Zhu, Guo-Niu, Gan, Zhongxue
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Robotics
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Mei, Aoran Wang, Jianhua Zhu, Guo-Niu Gan, Zhongxue
description	With their prominent scene understanding and reasoning capabilities, pre-trained visual-language models (VLMs) such as GPT-4V have attracted increasing attention in robotic task planning. Compared with traditional task planning strategies, VLMs are strong in multimodal information parsing and code generation and show remarkable efficiency. Although VLMs demonstrate great potential in robotic task planning, they suffer from challenges like hallucination, semantic complexity, and limited context. To handle such issues, this paper proposes a multi-agent framework, i.e., GameVLM, to enhance the decision-making process in robotic task planning. In this study, VLM-based decision and expert agents are presented to conduct the task planning. Specifically, decision agents are used to plan the task, and the expert agent is employed to evaluate these task plans. Zero-sum game theory is introduced to resolve inconsistencies among different agents and determine the optimal solution. Experimental results on real robots demonstrate the efficacy of the proposed framework, with an average success rate of 83.3%.
doi_str_mv	10.48550/arxiv.2405.13751
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2405_13751</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2405_13751</sourcerecordid><originalsourceid>FETCH-LOGICAL-a671-16840c66b18daf53c871cae46f31d507c02bfa76ed27c032cfee1fdce344b6ed3</originalsourceid><addsrcrecordid>eNotj81OhDAUhdm4MKMP4Mr7AmBLoRB34-iMJkw0hszCDbn0hzRAq6348_YDo6tzcs7Jvfmi6IqSJCvznNyg_zFfSZqRPKGsyOl59LHDUR2q_S2s4V4JE4yz8Yi9sR1s_dx9O9-Ddh5eXes-jYAaQw8vA1q7bO4wKAnOwsGECQeo0HYTdgr2TqohAFoJb8q7OEwjLL_CRXSmcQjq8l9XUb19qDePcfW8e9qsqxh5QWPKy4wIzltaStQ5E2VBBaqMa0ZlTgpB0lZjwZVMZ89SoZWiWgrFsqydU7aKrv_Onpibd29G9L_Nwt6c2NkRF95U8w</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>GameVLM: A Decision-making Framework for Robotic Task Planning Based on Visual Language Models and Zero-sum Games</title><source>arXiv.org</source><creator>Mei, Aoran ; Wang, Jianhua ; Zhu, Guo-Niu ; Gan, Zhongxue</creator><creatorcontrib>Mei, Aoran ; Wang, Jianhua ; Zhu, Guo-Niu ; Gan, Zhongxue</creatorcontrib><description>With their prominent scene understanding and reasoning capabilities, pre-trained visual-language models (VLMs) such as GPT-4V have attracted increasing attention in robotic task planning. Compared with traditional task planning strategies, VLMs are strong in multimodal information parsing and code generation and show remarkable efficiency. Although VLMs demonstrate great potential in robotic task planning, they suffer from challenges like hallucination, semantic complexity, and limited context. To handle such issues, this paper proposes a multi-agent framework, i.e., GameVLM, to enhance the decision-making process in robotic task planning. In this study, VLM-based decision and expert agents are presented to conduct the task planning. Specifically, decision agents are used to plan the task, and the expert agent is employed to evaluate these task plans. Zero-sum game theory is introduced to resolve inconsistencies among different agents and determine the optimal solution. Experimental results on real robots demonstrate the efficacy of the proposed framework, with an average success rate of 83.3%.</description><identifier>DOI: 10.48550/arxiv.2405.13751</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Robotics</subject><creationdate>2024-05</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2405.13751$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2405.13751$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Mei, Aoran</creatorcontrib><creatorcontrib>Wang, Jianhua</creatorcontrib><creatorcontrib>Zhu, Guo-Niu</creatorcontrib><creatorcontrib>Gan, Zhongxue</creatorcontrib><title>GameVLM: A Decision-making Framework for Robotic Task Planning Based on Visual Language Models and Zero-sum Games</title><description>With their prominent scene understanding and reasoning capabilities, pre-trained visual-language models (VLMs) such as GPT-4V have attracted increasing attention in robotic task planning. Compared with traditional task planning strategies, VLMs are strong in multimodal information parsing and code generation and show remarkable efficiency. Although VLMs demonstrate great potential in robotic task planning, they suffer from challenges like hallucination, semantic complexity, and limited context. To handle such issues, this paper proposes a multi-agent framework, i.e., GameVLM, to enhance the decision-making process in robotic task planning. In this study, VLM-based decision and expert agents are presented to conduct the task planning. Specifically, decision agents are used to plan the task, and the expert agent is employed to evaluate these task plans. Zero-sum game theory is introduced to resolve inconsistencies among different agents and determine the optimal solution. Experimental results on real robots demonstrate the efficacy of the proposed framework, with an average success rate of 83.3%.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Robotics</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj81OhDAUhdm4MKMP4Mr7AmBLoRB34-iMJkw0hszCDbn0hzRAq6348_YDo6tzcs7Jvfmi6IqSJCvznNyg_zFfSZqRPKGsyOl59LHDUR2q_S2s4V4JE4yz8Yi9sR1s_dx9O9-Ddh5eXes-jYAaQw8vA1q7bO4wKAnOwsGECQeo0HYTdgr2TqohAFoJb8q7OEwjLL_CRXSmcQjq8l9XUb19qDePcfW8e9qsqxh5QWPKy4wIzltaStQ5E2VBBaqMa0ZlTgpB0lZjwZVMZ89SoZWiWgrFsqydU7aKrv_Onpibd29G9L_Nwt6c2NkRF95U8w</recordid><startdate>20240522</startdate><enddate>20240522</enddate><creator>Mei, Aoran</creator><creator>Wang, Jianhua</creator><creator>Zhu, Guo-Niu</creator><creator>Gan, Zhongxue</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240522</creationdate><title>GameVLM: A Decision-making Framework for Robotic Task Planning Based on Visual Language Models and Zero-sum Games</title><author>Mei, Aoran ; Wang, Jianhua ; Zhu, Guo-Niu ; Gan, Zhongxue</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a671-16840c66b18daf53c871cae46f31d507c02bfa76ed27c032cfee1fdce344b6ed3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Robotics</topic><toplevel>online_resources</toplevel><creatorcontrib>Mei, Aoran</creatorcontrib><creatorcontrib>Wang, Jianhua</creatorcontrib><creatorcontrib>Zhu, Guo-Niu</creatorcontrib><creatorcontrib>Gan, Zhongxue</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Mei, Aoran</au><au>Wang, Jianhua</au><au>Zhu, Guo-Niu</au><au>Gan, Zhongxue</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>GameVLM: A Decision-making Framework for Robotic Task Planning Based on Visual Language Models and Zero-sum Games</atitle><date>2024-05-22</date><risdate>2024</risdate><abstract>With their prominent scene understanding and reasoning capabilities, pre-trained visual-language models (VLMs) such as GPT-4V have attracted increasing attention in robotic task planning. Compared with traditional task planning strategies, VLMs are strong in multimodal information parsing and code generation and show remarkable efficiency. Although VLMs demonstrate great potential in robotic task planning, they suffer from challenges like hallucination, semantic complexity, and limited context. To handle such issues, this paper proposes a multi-agent framework, i.e., GameVLM, to enhance the decision-making process in robotic task planning. In this study, VLM-based decision and expert agents are presented to conduct the task planning. Specifically, decision agents are used to plan the task, and the expert agent is employed to evaluate these task plans. Zero-sum game theory is introduced to resolve inconsistencies among different agents and determine the optimal solution. Experimental results on real robots demonstrate the efficacy of the proposed framework, with an average success rate of 83.3%.</abstract><doi>10.48550/arxiv.2405.13751</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2405.13751
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2405_13751
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Robotics
title	GameVLM: A Decision-making Framework for Robotic Task Planning Based on Visual Language Models and Zero-sum Games
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-29T08%3A41%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=GameVLM:%20A%20Decision-making%20Framework%20for%20Robotic%20Task%20Planning%20Based%20on%20Visual%20Language%20Models%20and%20Zero-sum%20Games&rft.au=Mei,%20Aoran&rft.date=2024-05-22&rft_id=info:doi/10.48550/arxiv.2405.13751&rft_dat=%3Carxiv_GOX%3E2405_13751%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true