Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning

In this study, we are interested in imbuing robots with the capability of physically-grounded task planning. Recent advancements have shown that large language models (LLMs) possess extensive knowledge useful in robotic tasks, especially in reasoning and planning. However, LLMs are constrained by th...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Hu, Yingdong, Lin, Fanqi, Zhang, Tong, Yi, Li, Gao, Yang
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition Computer Science - Learning Computer Science - Robotics
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Hu, Yingdong Lin, Fanqi Zhang, Tong Yi, Li Gao, Yang
description	In this study, we are interested in imbuing robots with the capability of physically-grounded task planning. Recent advancements have shown that large language models (LLMs) possess extensive knowledge useful in robotic tasks, especially in reasoning and planning. However, LLMs are constrained by their lack of world grounding and dependence on external affordance models to perceive environmental information, which cannot jointly reason with LLMs. We argue that a task planner should be an inherently grounded, unified multimodal system. To this end, we introduce Robotic Vision-Language Planning (ViLa), a novel approach for long-horizon robotic planning that leverages vision-language models (VLMs) to generate a sequence of actionable steps. ViLa directly integrates perceptual data into its reasoning and planning process, enabling a profound understanding of commonsense knowledge in the visual world, including spatial layouts and object attributes. It also supports flexible multimodal goal specification and naturally incorporates visual feedback. Our extensive evaluation, conducted in both real-robot and simulated environments, demonstrates ViLa's superiority over existing LLM-based planners, highlighting its effectiveness in a wide array of open-world manipulation tasks.
doi_str_mv	10.48550/arxiv.2311.17842
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2311_17842</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2311_17842</sourcerecordid><originalsourceid>FETCH-LOGICAL-a672-d29fbb20360ad1934e8e70f328a606b500db694e6354360cb4a07b25d55e1ed33</originalsourceid><addsrcrecordid>eNotj7tOw0AQRbehQIEPoMr8gM2-bdNBBAHJElFkgqis2XhsVpjdyHkAfx8TqG5z7pEOY1eCpzo3hl_j8O0PqVRCpCLLtTxnr2WMH3BHbRwI3uIeSsLNDbyEA_nehw527wSL-EUDxBbmiyrRK_ABltHFnV_Dym99DEmJodtjN6I9hjD-LthZi_2WLv93wqqH-2r2mJTP86fZbZmgzWTSyKJ1TnJlOTaiUJpyynirZI6WW2c4b5wtNFll9MisnUaeOWkaY0hQo9SETf-0p7J6M_hPHH7q38L6VKiOPKlJ4Q</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning</title><source>arXiv.org</source><creator>Hu, Yingdong ; Lin, Fanqi ; Zhang, Tong ; Yi, Li ; Gao, Yang</creator><creatorcontrib>Hu, Yingdong ; Lin, Fanqi ; Zhang, Tong ; Yi, Li ; Gao, Yang</creatorcontrib><description>In this study, we are interested in imbuing robots with the capability of physically-grounded task planning. Recent advancements have shown that large language models (LLMs) possess extensive knowledge useful in robotic tasks, especially in reasoning and planning. However, LLMs are constrained by their lack of world grounding and dependence on external affordance models to perceive environmental information, which cannot jointly reason with LLMs. We argue that a task planner should be an inherently grounded, unified multimodal system. To this end, we introduce Robotic Vision-Language Planning (ViLa), a novel approach for long-horizon robotic planning that leverages vision-language models (VLMs) to generate a sequence of actionable steps. ViLa directly integrates perceptual data into its reasoning and planning process, enabling a profound understanding of commonsense knowledge in the visual world, including spatial layouts and object attributes. It also supports flexible multimodal goal specification and naturally incorporates visual feedback. Our extensive evaluation, conducted in both real-robot and simulated environments, demonstrates ViLa's superiority over existing LLM-based planners, highlighting its effectiveness in a wide array of open-world manipulation tasks.</description><identifier>DOI: 10.48550/arxiv.2311.17842</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computation and Language ; Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Learning ; Computer Science - Robotics</subject><creationdate>2023-11</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2311.17842$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2311.17842$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Hu, Yingdong</creatorcontrib><creatorcontrib>Lin, Fanqi</creatorcontrib><creatorcontrib>Zhang, Tong</creatorcontrib><creatorcontrib>Yi, Li</creatorcontrib><creatorcontrib>Gao, Yang</creatorcontrib><title>Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning</title><description>In this study, we are interested in imbuing robots with the capability of physically-grounded task planning. Recent advancements have shown that large language models (LLMs) possess extensive knowledge useful in robotic tasks, especially in reasoning and planning. However, LLMs are constrained by their lack of world grounding and dependence on external affordance models to perceive environmental information, which cannot jointly reason with LLMs. We argue that a task planner should be an inherently grounded, unified multimodal system. To this end, we introduce Robotic Vision-Language Planning (ViLa), a novel approach for long-horizon robotic planning that leverages vision-language models (VLMs) to generate a sequence of actionable steps. ViLa directly integrates perceptual data into its reasoning and planning process, enabling a profound understanding of commonsense knowledge in the visual world, including spatial layouts and object attributes. It also supports flexible multimodal goal specification and naturally incorporates visual feedback. Our extensive evaluation, conducted in both real-robot and simulated environments, demonstrates ViLa's superiority over existing LLM-based planners, highlighting its effectiveness in a wide array of open-world manipulation tasks.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer Science - Learning</subject><subject>Computer Science - Robotics</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj7tOw0AQRbehQIEPoMr8gM2-bdNBBAHJElFkgqis2XhsVpjdyHkAfx8TqG5z7pEOY1eCpzo3hl_j8O0PqVRCpCLLtTxnr2WMH3BHbRwI3uIeSsLNDbyEA_nehw527wSL-EUDxBbmiyrRK_ABltHFnV_Dym99DEmJodtjN6I9hjD-LthZi_2WLv93wqqH-2r2mJTP86fZbZmgzWTSyKJ1TnJlOTaiUJpyynirZI6WW2c4b5wtNFll9MisnUaeOWkaY0hQo9SETf-0p7J6M_hPHH7q38L6VKiOPKlJ4Q</recordid><startdate>20231129</startdate><enddate>20231129</enddate><creator>Hu, Yingdong</creator><creator>Lin, Fanqi</creator><creator>Zhang, Tong</creator><creator>Yi, Li</creator><creator>Gao, Yang</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231129</creationdate><title>Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning</title><author>Hu, Yingdong ; Lin, Fanqi ; Zhang, Tong ; Yi, Li ; Gao, Yang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a672-d29fbb20360ad1934e8e70f328a606b500db694e6354360cb4a07b25d55e1ed33</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer Science - Learning</topic><topic>Computer Science - Robotics</topic><toplevel>online_resources</toplevel><creatorcontrib>Hu, Yingdong</creatorcontrib><creatorcontrib>Lin, Fanqi</creatorcontrib><creatorcontrib>Zhang, Tong</creatorcontrib><creatorcontrib>Yi, Li</creatorcontrib><creatorcontrib>Gao, Yang</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Hu, Yingdong</au><au>Lin, Fanqi</au><au>Zhang, Tong</au><au>Yi, Li</au><au>Gao, Yang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning</atitle><date>2023-11-29</date><risdate>2023</risdate><abstract>In this study, we are interested in imbuing robots with the capability of physically-grounded task planning. Recent advancements have shown that large language models (LLMs) possess extensive knowledge useful in robotic tasks, especially in reasoning and planning. However, LLMs are constrained by their lack of world grounding and dependence on external affordance models to perceive environmental information, which cannot jointly reason with LLMs. We argue that a task planner should be an inherently grounded, unified multimodal system. To this end, we introduce Robotic Vision-Language Planning (ViLa), a novel approach for long-horizon robotic planning that leverages vision-language models (VLMs) to generate a sequence of actionable steps. ViLa directly integrates perceptual data into its reasoning and planning process, enabling a profound understanding of commonsense knowledge in the visual world, including spatial layouts and object attributes. It also supports flexible multimodal goal specification and naturally incorporates visual feedback. Our extensive evaluation, conducted in both real-robot and simulated environments, demonstrates ViLa's superiority over existing LLM-based planners, highlighting its effectiveness in a wide array of open-world manipulation tasks.</abstract><doi>10.48550/arxiv.2311.17842</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2311.17842
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2311_17842
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition Computer Science - Learning Computer Science - Robotics
title	Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-05T06%3A06%3A25IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Look%20Before%20You%20Leap:%20Unveiling%20the%20Power%20of%20GPT-4V%20in%20Robotic%20Vision-Language%20Planning&rft.au=Hu,%20Yingdong&rft.date=2023-11-29&rft_id=info:doi/10.48550/arxiv.2311.17842&rft_dat=%3Carxiv_GOX%3E2311_17842%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true