Clio: Real-time Task-Driven Open-Set 3D Scene Graphs

Modern tools for class-agnostic image segmentation (e.g., SegmentAnything) and open-set semantic understanding (e.g., CLIP) provide unprecedented opportunities for robot perception and mapping. While traditional closed-set metric-semantic maps were restricted to tens or hundreds of semantic classes,...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-09
Hauptverfasser: Maggio, Dominic, Chang, Yun, Hughes, Nathan, Trang, Matthew, Griffith, Dan, Dougherty, Carlyn, Cristofalo, Eric, Schmid, Lukas, Carlone, Luca
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Maggio, Dominic
Chang, Yun
Hughes, Nathan
Trang, Matthew
Griffith, Dan
Dougherty, Carlyn
Cristofalo, Eric
Schmid, Lukas
Carlone, Luca
description Modern tools for class-agnostic image segmentation (e.g., SegmentAnything) and open-set semantic understanding (e.g., CLIP) provide unprecedented opportunities for robot perception and mapping. While traditional closed-set metric-semantic maps were restricted to tens or hundreds of semantic classes, we can now build maps with a plethora of objects and countless semantic variations. This leaves us with a fundamental question: what is the right granularity for the objects (and, more generally, for the semantic concepts) the robot has to include in its map representation? While related work implicitly chooses a level of granularity by tuning thresholds for object detection, we argue that such a choice is intrinsically task-dependent. The first contribution of this paper is to propose a task-driven 3D scene understanding problem, where the robot is given a list of tasks in natural language and has to select the granularity and the subset of objects and scene structure to retain in its map that is sufficient to complete the tasks. We show that this problem can be naturally formulated using the Information Bottleneck (IB), an established information-theoretic framework. The second contribution is an algorithm for task-driven 3D scene understanding based on an Agglomerative IB approach, that is able to cluster 3D primitives in the environment into task-relevant objects and regions and executes incrementally. The third contribution is to integrate our task-driven clustering algorithm into a real-time pipeline, named Clio, that constructs a hierarchical 3D scene graph of the environment online using only onboard compute, as the robot explores it. Our final contribution is an extensive experimental campaign showing that Clio not only allows real-time construction of compact open-set 3D scene graphs, but also improves the accuracy of task execution by limiting the map to relevant semantic concepts.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3044856247</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3044856247</sourcerecordid><originalsourceid>FETCH-proquest_journals_30448562473</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mQwcc7JzLdSCEpNzNEtycxNVQhJLM7WdSnKLEvNU_AvSM3TDU4tUTB2UQhOTs1LVXAvSizIKOZhYE1LzClO5YXS3AzKbq4hzh66BUX5haWpxSXxWfmlRXlAqXhjAxMTC1MzIxNzY-JUAQB_-TLN</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3044856247</pqid></control><display><type>article</type><title>Clio: Real-time Task-Driven Open-Set 3D Scene Graphs</title><source>Free E- Journals</source><creator>Maggio, Dominic ; Chang, Yun ; Hughes, Nathan ; Trang, Matthew ; Griffith, Dan ; Dougherty, Carlyn ; Cristofalo, Eric ; Schmid, Lukas ; Carlone, Luca</creator><creatorcontrib>Maggio, Dominic ; Chang, Yun ; Hughes, Nathan ; Trang, Matthew ; Griffith, Dan ; Dougherty, Carlyn ; Cristofalo, Eric ; Schmid, Lukas ; Carlone, Luca</creatorcontrib><description>Modern tools for class-agnostic image segmentation (e.g., SegmentAnything) and open-set semantic understanding (e.g., CLIP) provide unprecedented opportunities for robot perception and mapping. While traditional closed-set metric-semantic maps were restricted to tens or hundreds of semantic classes, we can now build maps with a plethora of objects and countless semantic variations. This leaves us with a fundamental question: what is the right granularity for the objects (and, more generally, for the semantic concepts) the robot has to include in its map representation? While related work implicitly chooses a level of granularity by tuning thresholds for object detection, we argue that such a choice is intrinsically task-dependent. The first contribution of this paper is to propose a task-driven 3D scene understanding problem, where the robot is given a list of tasks in natural language and has to select the granularity and the subset of objects and scene structure to retain in its map that is sufficient to complete the tasks. We show that this problem can be naturally formulated using the Information Bottleneck (IB), an established information-theoretic framework. The second contribution is an algorithm for task-driven 3D scene understanding based on an Agglomerative IB approach, that is able to cluster 3D primitives in the environment into task-relevant objects and regions and executes incrementally. The third contribution is to integrate our task-driven clustering algorithm into a real-time pipeline, named Clio, that constructs a hierarchical 3D scene graph of the environment online using only onboard compute, as the robot explores it. Our final contribution is an extensive experimental campaign showing that Clio not only allows real-time construction of compact open-set 3D scene graphs, but also improves the accuracy of task execution by limiting the map to relevant semantic concepts.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Algorithms ; Clustering ; Graphs ; Image segmentation ; Information theory ; Object recognition ; Real time ; Robots ; Scene analysis ; Semantics</subject><ispartof>arXiv.org, 2024-09</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Maggio, Dominic</creatorcontrib><creatorcontrib>Chang, Yun</creatorcontrib><creatorcontrib>Hughes, Nathan</creatorcontrib><creatorcontrib>Trang, Matthew</creatorcontrib><creatorcontrib>Griffith, Dan</creatorcontrib><creatorcontrib>Dougherty, Carlyn</creatorcontrib><creatorcontrib>Cristofalo, Eric</creatorcontrib><creatorcontrib>Schmid, Lukas</creatorcontrib><creatorcontrib>Carlone, Luca</creatorcontrib><title>Clio: Real-time Task-Driven Open-Set 3D Scene Graphs</title><title>arXiv.org</title><description>Modern tools for class-agnostic image segmentation (e.g., SegmentAnything) and open-set semantic understanding (e.g., CLIP) provide unprecedented opportunities for robot perception and mapping. While traditional closed-set metric-semantic maps were restricted to tens or hundreds of semantic classes, we can now build maps with a plethora of objects and countless semantic variations. This leaves us with a fundamental question: what is the right granularity for the objects (and, more generally, for the semantic concepts) the robot has to include in its map representation? While related work implicitly chooses a level of granularity by tuning thresholds for object detection, we argue that such a choice is intrinsically task-dependent. The first contribution of this paper is to propose a task-driven 3D scene understanding problem, where the robot is given a list of tasks in natural language and has to select the granularity and the subset of objects and scene structure to retain in its map that is sufficient to complete the tasks. We show that this problem can be naturally formulated using the Information Bottleneck (IB), an established information-theoretic framework. The second contribution is an algorithm for task-driven 3D scene understanding based on an Agglomerative IB approach, that is able to cluster 3D primitives in the environment into task-relevant objects and regions and executes incrementally. The third contribution is to integrate our task-driven clustering algorithm into a real-time pipeline, named Clio, that constructs a hierarchical 3D scene graph of the environment online using only onboard compute, as the robot explores it. Our final contribution is an extensive experimental campaign showing that Clio not only allows real-time construction of compact open-set 3D scene graphs, but also improves the accuracy of task execution by limiting the map to relevant semantic concepts.</description><subject>Algorithms</subject><subject>Clustering</subject><subject>Graphs</subject><subject>Image segmentation</subject><subject>Information theory</subject><subject>Object recognition</subject><subject>Real time</subject><subject>Robots</subject><subject>Scene analysis</subject><subject>Semantics</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mQwcc7JzLdSCEpNzNEtycxNVQhJLM7WdSnKLEvNU_AvSM3TDU4tUTB2UQhOTs1LVXAvSizIKOZhYE1LzClO5YXS3AzKbq4hzh66BUX5haWpxSXxWfmlRXlAqXhjAxMTC1MzIxNzY-JUAQB_-TLN</recordid><startdate>20240926</startdate><enddate>20240926</enddate><creator>Maggio, Dominic</creator><creator>Chang, Yun</creator><creator>Hughes, Nathan</creator><creator>Trang, Matthew</creator><creator>Griffith, Dan</creator><creator>Dougherty, Carlyn</creator><creator>Cristofalo, Eric</creator><creator>Schmid, Lukas</creator><creator>Carlone, Luca</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240926</creationdate><title>Clio: Real-time Task-Driven Open-Set 3D Scene Graphs</title><author>Maggio, Dominic ; Chang, Yun ; Hughes, Nathan ; Trang, Matthew ; Griffith, Dan ; Dougherty, Carlyn ; Cristofalo, Eric ; Schmid, Lukas ; Carlone, Luca</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30448562473</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Algorithms</topic><topic>Clustering</topic><topic>Graphs</topic><topic>Image segmentation</topic><topic>Information theory</topic><topic>Object recognition</topic><topic>Real time</topic><topic>Robots</topic><topic>Scene analysis</topic><topic>Semantics</topic><toplevel>online_resources</toplevel><creatorcontrib>Maggio, Dominic</creatorcontrib><creatorcontrib>Chang, Yun</creatorcontrib><creatorcontrib>Hughes, Nathan</creatorcontrib><creatorcontrib>Trang, Matthew</creatorcontrib><creatorcontrib>Griffith, Dan</creatorcontrib><creatorcontrib>Dougherty, Carlyn</creatorcontrib><creatorcontrib>Cristofalo, Eric</creatorcontrib><creatorcontrib>Schmid, Lukas</creatorcontrib><creatorcontrib>Carlone, Luca</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Maggio, Dominic</au><au>Chang, Yun</au><au>Hughes, Nathan</au><au>Trang, Matthew</au><au>Griffith, Dan</au><au>Dougherty, Carlyn</au><au>Cristofalo, Eric</au><au>Schmid, Lukas</au><au>Carlone, Luca</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Clio: Real-time Task-Driven Open-Set 3D Scene Graphs</atitle><jtitle>arXiv.org</jtitle><date>2024-09-26</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Modern tools for class-agnostic image segmentation (e.g., SegmentAnything) and open-set semantic understanding (e.g., CLIP) provide unprecedented opportunities for robot perception and mapping. While traditional closed-set metric-semantic maps were restricted to tens or hundreds of semantic classes, we can now build maps with a plethora of objects and countless semantic variations. This leaves us with a fundamental question: what is the right granularity for the objects (and, more generally, for the semantic concepts) the robot has to include in its map representation? While related work implicitly chooses a level of granularity by tuning thresholds for object detection, we argue that such a choice is intrinsically task-dependent. The first contribution of this paper is to propose a task-driven 3D scene understanding problem, where the robot is given a list of tasks in natural language and has to select the granularity and the subset of objects and scene structure to retain in its map that is sufficient to complete the tasks. We show that this problem can be naturally formulated using the Information Bottleneck (IB), an established information-theoretic framework. The second contribution is an algorithm for task-driven 3D scene understanding based on an Agglomerative IB approach, that is able to cluster 3D primitives in the environment into task-relevant objects and regions and executes incrementally. The third contribution is to integrate our task-driven clustering algorithm into a real-time pipeline, named Clio, that constructs a hierarchical 3D scene graph of the environment online using only onboard compute, as the robot explores it. Our final contribution is an extensive experimental campaign showing that Clio not only allows real-time construction of compact open-set 3D scene graphs, but also improves the accuracy of task execution by limiting the map to relevant semantic concepts.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-09
issn 2331-8422
language eng
recordid cdi_proquest_journals_3044856247
source Free E- Journals
subjects Algorithms
Clustering
Graphs
Image segmentation
Information theory
Object recognition
Real time
Robots
Scene analysis
Semantics
title Clio: Real-time Task-Driven Open-Set 3D Scene Graphs
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-30T03%3A49%3A07IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Clio:%20Real-time%20Task-Driven%20Open-Set%203D%20Scene%20Graphs&rft.jtitle=arXiv.org&rft.au=Maggio,%20Dominic&rft.date=2024-09-26&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3044856247%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3044856247&rft_id=info:pmid/&rfr_iscdi=true