Weakly Supervised 3D Open-vocabulary Segmentation

Open-vocabulary segmentation of 3D scenes is a fundamental function of human perception and thus a crucial objective in computer vision research. However, this task is heavily impeded by the lack of large-scale and diverse 3D open-vocabulary segmentation datasets for training robust and generalizabl...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2023-09
Hauptverfasser:	Liu, Kunhao, Zhan, Fangneng, Zhang, Jiahui, Xu, Muyu, Yu, Yingchen, Abdulmotaleb El Saddik, Theobalt, Christian, Xing, Eric, Lu, Shijian
Format:	Artikel
Sprache:	eng
Schlagworte:	Alignment Annotations Computer vision Datasets Distillation Image segmentation Training Two dimensional models
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Liu, Kunhao Zhan, Fangneng Zhang, Jiahui Xu, Muyu Yu, Yingchen Abdulmotaleb El Saddik Theobalt, Christian Xing, Eric Lu, Shijian
description	Open-vocabulary segmentation of 3D scenes is a fundamental function of human perception and thus a crucial objective in computer vision research. However, this task is heavily impeded by the lack of large-scale and diverse 3D open-vocabulary segmentation datasets for training robust and generalizable models. Distilling knowledge from pre-trained 2D open-vocabulary segmentation models helps but it compromises the open-vocabulary feature as the 2D models are mostly finetuned with close-vocabulary datasets. We tackle the challenges in 3D open-vocabulary segmentation by exploiting pre-trained foundation models CLIP and DINO in a weakly supervised manner. Specifically, given only the open-vocabulary text descriptions of the objects in a scene, we distill the open-vocabulary multimodal knowledge and object reasoning capability of CLIP and DINO into a neural radiance field (NeRF), which effectively lifts 2D features into view-consistent 3D segmentation. A notable aspect of our approach is that it does not require any manual segmentation annotations for either the foundation models or the distillation process. Extensive experiments show that our method even outperforms fully supervised models trained with segmentation annotations in certain scenes, suggesting that 3D open-vocabulary segmentation can be effectively learned from 2D images and text-image pairs. Code is available at \url{https://github.com/Kunhao-Liu/3D-OVS}.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2819139786</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2819139786</sourcerecordid><originalsourceid>FETCH-proquest_journals_28191397863</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mQwDE9NzM6pVAguLUgtKsssTk1RMHZR8C9IzdMty09OTCrNSSwCyqam56bmlSSWZObn8TCwpiXmFKfyQmluBmU31xBnD92CovzC0tTikvis_NKiPKBUvJGFoaWhsaW5hZkxcaoAoIUzlQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2819139786</pqid></control><display><type>article</type><title>Weakly Supervised 3D Open-vocabulary Segmentation</title><source>Free E- Journals</source><creator>Liu, Kunhao ; Zhan, Fangneng ; Zhang, Jiahui ; Xu, Muyu ; Yu, Yingchen ; Abdulmotaleb El Saddik ; Theobalt, Christian ; Xing, Eric ; Lu, Shijian</creator><creatorcontrib>Liu, Kunhao ; Zhan, Fangneng ; Zhang, Jiahui ; Xu, Muyu ; Yu, Yingchen ; Abdulmotaleb El Saddik ; Theobalt, Christian ; Xing, Eric ; Lu, Shijian</creatorcontrib><description>Open-vocabulary segmentation of 3D scenes is a fundamental function of human perception and thus a crucial objective in computer vision research. However, this task is heavily impeded by the lack of large-scale and diverse 3D open-vocabulary segmentation datasets for training robust and generalizable models. Distilling knowledge from pre-trained 2D open-vocabulary segmentation models helps but it compromises the open-vocabulary feature as the 2D models are mostly finetuned with close-vocabulary datasets. We tackle the challenges in 3D open-vocabulary segmentation by exploiting pre-trained foundation models CLIP and DINO in a weakly supervised manner. Specifically, given only the open-vocabulary text descriptions of the objects in a scene, we distill the open-vocabulary multimodal knowledge and object reasoning capability of CLIP and DINO into a neural radiance field (NeRF), which effectively lifts 2D features into view-consistent 3D segmentation. A notable aspect of our approach is that it does not require any manual segmentation annotations for either the foundation models or the distillation process. Extensive experiments show that our method even outperforms fully supervised models trained with segmentation annotations in certain scenes, suggesting that 3D open-vocabulary segmentation can be effectively learned from 2D images and text-image pairs. Code is available at \url{https://github.com/Kunhao-Liu/3D-OVS}.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Alignment ; Annotations ; Computer vision ; Datasets ; Distillation ; Image segmentation ; Training ; Two dimensional models</subject><ispartof>arXiv.org, 2023-09</ispartof><rights>2023. This work is published under http://creativecommons.org/licenses/by-nc-sa/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Liu, Kunhao</creatorcontrib><creatorcontrib>Zhan, Fangneng</creatorcontrib><creatorcontrib>Zhang, Jiahui</creatorcontrib><creatorcontrib>Xu, Muyu</creatorcontrib><creatorcontrib>Yu, Yingchen</creatorcontrib><creatorcontrib>Abdulmotaleb El Saddik</creatorcontrib><creatorcontrib>Theobalt, Christian</creatorcontrib><creatorcontrib>Xing, Eric</creatorcontrib><creatorcontrib>Lu, Shijian</creatorcontrib><title>Weakly Supervised 3D Open-vocabulary Segmentation</title><title>arXiv.org</title><description>Open-vocabulary segmentation of 3D scenes is a fundamental function of human perception and thus a crucial objective in computer vision research. However, this task is heavily impeded by the lack of large-scale and diverse 3D open-vocabulary segmentation datasets for training robust and generalizable models. Distilling knowledge from pre-trained 2D open-vocabulary segmentation models helps but it compromises the open-vocabulary feature as the 2D models are mostly finetuned with close-vocabulary datasets. We tackle the challenges in 3D open-vocabulary segmentation by exploiting pre-trained foundation models CLIP and DINO in a weakly supervised manner. Specifically, given only the open-vocabulary text descriptions of the objects in a scene, we distill the open-vocabulary multimodal knowledge and object reasoning capability of CLIP and DINO into a neural radiance field (NeRF), which effectively lifts 2D features into view-consistent 3D segmentation. A notable aspect of our approach is that it does not require any manual segmentation annotations for either the foundation models or the distillation process. Extensive experiments show that our method even outperforms fully supervised models trained with segmentation annotations in certain scenes, suggesting that 3D open-vocabulary segmentation can be effectively learned from 2D images and text-image pairs. Code is available at \url{https://github.com/Kunhao-Liu/3D-OVS}.</description><subject>Alignment</subject><subject>Annotations</subject><subject>Computer vision</subject><subject>Datasets</subject><subject>Distillation</subject><subject>Image segmentation</subject><subject>Training</subject><subject>Two dimensional models</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mQwDE9NzM6pVAguLUgtKsssTk1RMHZR8C9IzdMty09OTCrNSSwCyqam56bmlSSWZObn8TCwpiXmFKfyQmluBmU31xBnD92CovzC0tTikvis_NKiPKBUvJGFoaWhsaW5hZkxcaoAoIUzlQ</recordid><startdate>20230927</startdate><enddate>20230927</enddate><creator>Liu, Kunhao</creator><creator>Zhan, Fangneng</creator><creator>Zhang, Jiahui</creator><creator>Xu, Muyu</creator><creator>Yu, Yingchen</creator><creator>Abdulmotaleb El Saddik</creator><creator>Theobalt, Christian</creator><creator>Xing, Eric</creator><creator>Lu, Shijian</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20230927</creationdate><title>Weakly Supervised 3D Open-vocabulary Segmentation</title><author>Liu, Kunhao ; Zhan, Fangneng ; Zhang, Jiahui ; Xu, Muyu ; Yu, Yingchen ; Abdulmotaleb El Saddik ; Theobalt, Christian ; Xing, Eric ; Lu, Shijian</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28191397863</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Alignment</topic><topic>Annotations</topic><topic>Computer vision</topic><topic>Datasets</topic><topic>Distillation</topic><topic>Image segmentation</topic><topic>Training</topic><topic>Two dimensional models</topic><toplevel>online_resources</toplevel><creatorcontrib>Liu, Kunhao</creatorcontrib><creatorcontrib>Zhan, Fangneng</creatorcontrib><creatorcontrib>Zhang, Jiahui</creatorcontrib><creatorcontrib>Xu, Muyu</creatorcontrib><creatorcontrib>Yu, Yingchen</creatorcontrib><creatorcontrib>Abdulmotaleb El Saddik</creatorcontrib><creatorcontrib>Theobalt, Christian</creatorcontrib><creatorcontrib>Xing, Eric</creatorcontrib><creatorcontrib>Lu, Shijian</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Liu, Kunhao</au><au>Zhan, Fangneng</au><au>Zhang, Jiahui</au><au>Xu, Muyu</au><au>Yu, Yingchen</au><au>Abdulmotaleb El Saddik</au><au>Theobalt, Christian</au><au>Xing, Eric</au><au>Lu, Shijian</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Weakly Supervised 3D Open-vocabulary Segmentation</atitle><jtitle>arXiv.org</jtitle><date>2023-09-27</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>Open-vocabulary segmentation of 3D scenes is a fundamental function of human perception and thus a crucial objective in computer vision research. However, this task is heavily impeded by the lack of large-scale and diverse 3D open-vocabulary segmentation datasets for training robust and generalizable models. Distilling knowledge from pre-trained 2D open-vocabulary segmentation models helps but it compromises the open-vocabulary feature as the 2D models are mostly finetuned with close-vocabulary datasets. We tackle the challenges in 3D open-vocabulary segmentation by exploiting pre-trained foundation models CLIP and DINO in a weakly supervised manner. Specifically, given only the open-vocabulary text descriptions of the objects in a scene, we distill the open-vocabulary multimodal knowledge and object reasoning capability of CLIP and DINO into a neural radiance field (NeRF), which effectively lifts 2D features into view-consistent 3D segmentation. A notable aspect of our approach is that it does not require any manual segmentation annotations for either the foundation models or the distillation process. Extensive experiments show that our method even outperforms fully supervised models trained with segmentation annotations in certain scenes, suggesting that 3D open-vocabulary segmentation can be effectively learned from 2D images and text-image pairs. Code is available at \url{https://github.com/Kunhao-Liu/3D-OVS}.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2023-09
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2819139786
source	Free E- Journals
subjects	Alignment Annotations Computer vision Datasets Distillation Image segmentation Training Two dimensional models
title	Weakly Supervised 3D Open-vocabulary Segmentation
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-04T15%3A11%3A48IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Weakly%20Supervised%203D%20Open-vocabulary%20Segmentation&rft.jtitle=arXiv.org&rft.au=Liu,%20Kunhao&rft.date=2023-09-27&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2819139786%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2819139786&rft_id=info:pmid/&rfr_iscdi=true