OV-NeRF: Open-Vocabulary Neural Radiance Fields With Vision and Language Foundation Models for 3D Semantic Understanding

The development of Neural Radiance Fields (NeRFs) has provided a potent representation for encapsulating the geometric and appearance characteristics of 3D scenes. Enhancing the capabilities of NeRFs in open-vocabulary 3D semantic perception tasks has been a recent focus. However, current methods th...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on circuits and systems for video technology 2024-12, Vol.34 (12), p.12923-12936
Hauptverfasser:	Liao, Guibiao, Zhou, Kaichen, Bao, Zhenyu, Liu, Kanglin, Li, Qing
Format:	Artikel
Sprache:	eng
Schlagworte:	Circuits and systems cross-view self-enhancement Learning Neural radiance field open-vocabulary Radiance Regularization Rendering (computer graphics) Semantics Solid modeling Three-dimensional displays Training vision and language foundation models
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	12936
container_issue	12
container_start_page	12923
container_title	IEEE transactions on circuits and systems for video technology
container_volume	34
creator	Liao, Guibiao Zhou, Kaichen Bao, Zhenyu Liu, Kanglin Li, Qing
description	The development of Neural Radiance Fields (NeRFs) has provided a potent representation for encapsulating the geometric and appearance characteristics of 3D scenes. Enhancing the capabilities of NeRFs in open-vocabulary 3D semantic perception tasks has been a recent focus. However, current methods that extract semantics directly from Contrastive Language-Image Pretraining (CLIP) for semantic field learning encounter difficulties due to noisy and view-inconsistent semantics provided by CLIP. To tackle these limitations, we propose OV-NeRF, which exploits the potential of pre-trained vision and language foundation models to enhance semantic field learning through proposed single-view and cross-view strategies. First, from the single-view perspective, we introduce Region Semantic Ranking (RSR) regularization by leveraging 2D mask proposals derived from Segment Anything (SAM) to rectify the noisy semantics of each training view, facilitating accurate semantic field learning. Second, from the cross-view perspective, we propose a Cross-view Self-enhancement (CSE) strategy to address the challenge raised by view-inconsistent semantics. Rather than invariably utilizing the 2D inconsistent semantics from CLIP, CSE leverages the 3D consistent semantics generated from the well-trained semantic field itself for semantic field training, aiming to reduce ambiguity and enhance overall semantic consistency across different views. Extensive experiments validate our OV-NeRF outperforms current state-of-the-art methods, achieving a significant improvement of 20.31% and 18.42% in mIoU metric on Replica and ScanNet, respectively. Furthermore, our approach exhibits consistent superior results across various CLIP configurations, further verifying its robustness. Codes are available at: https://github.com/pcl3dv/OV-NeRF .
doi_str_mv	10.1109/TCSVT.2024.3439737
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_3147528545</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10630553</ieee_id><sourcerecordid>3147528545</sourcerecordid><originalsourceid>FETCH-LOGICAL-c221t-b7646786e6b411fd28d4c3e034dbfd0822cd62919efb09d692fbe4aabbe8fe4b3</originalsourceid><addsrcrecordid>eNpNkE1Lw0AQhoMoWKt_QDwseE7dz3x4k2pVqC30Ix7Dbna2bombupuA_ntT24OnGWbeZwaeKLomeEQIzu9W42WxGlFM-YhxlqcsPYkGRIgsphSL077HgsQZJeI8ughhizHhGU8H0fe8iGewmNyj-Q5cXDSVVF0t_Q-aQedljRZSW-kqQBMLtQ7o3bYfqLDBNg5Jp9FUuk0nN_2-6ZyW7X7-1mioAzKNR-wRLeFTutZWaO00-ND2lHWby-jMyDrA1bEOo_XkaTV-iafz59fxwzSuKCVtrNKEJ2mWQKI4IUbTTPOKAWZcK6NxRmmlE5qTHIzCuU5yahRwKZWCzABXbBjdHu7ufPPVQWjLbdN5178sGeGpoJngok_RQ6ryTQgeTLnz9rPXUBJc7g2Xf4bLveHyaLiHbg6QBYB_QMKwEIz9AvJWeI8</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3147528545</pqid></control><display><type>article</type><title>OV-NeRF: Open-Vocabulary Neural Radiance Fields With Vision and Language Foundation Models for 3D Semantic Understanding</title><source>IEEE Electronic Library (IEL)</source><creator>Liao, Guibiao ; Zhou, Kaichen ; Bao, Zhenyu ; Liu, Kanglin ; Li, Qing</creator><creatorcontrib>Liao, Guibiao ; Zhou, Kaichen ; Bao, Zhenyu ; Liu, Kanglin ; Li, Qing</creatorcontrib><description>The development of Neural Radiance Fields (NeRFs) has provided a potent representation for encapsulating the geometric and appearance characteristics of 3D scenes. Enhancing the capabilities of NeRFs in open-vocabulary 3D semantic perception tasks has been a recent focus. However, current methods that extract semantics directly from Contrastive Language-Image Pretraining (CLIP) for semantic field learning encounter difficulties due to noisy and view-inconsistent semantics provided by CLIP. To tackle these limitations, we propose OV-NeRF, which exploits the potential of pre-trained vision and language foundation models to enhance semantic field learning through proposed single-view and cross-view strategies. First, from the single-view perspective, we introduce Region Semantic Ranking (RSR) regularization by leveraging 2D mask proposals derived from Segment Anything (SAM) to rectify the noisy semantics of each training view, facilitating accurate semantic field learning. Second, from the cross-view perspective, we propose a Cross-view Self-enhancement (CSE) strategy to address the challenge raised by view-inconsistent semantics. Rather than invariably utilizing the 2D inconsistent semantics from CLIP, CSE leverages the 3D consistent semantics generated from the well-trained semantic field itself for semantic field training, aiming to reduce ambiguity and enhance overall semantic consistency across different views. Extensive experiments validate our OV-NeRF outperforms current state-of-the-art methods, achieving a significant improvement of 20.31% and 18.42% in mIoU metric on Replica and ScanNet, respectively. Furthermore, our approach exhibits consistent superior results across various CLIP configurations, further verifying its robustness. Codes are available at: https://github.com/pcl3dv/OV-NeRF .</description><identifier>ISSN: 1051-8215</identifier><identifier>EISSN: 1558-2205</identifier><identifier>DOI: 10.1109/TCSVT.2024.3439737</identifier><identifier>CODEN: ITCTEM</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Circuits and systems ; cross-view self-enhancement ; Learning ; Neural radiance field ; open-vocabulary ; Radiance ; Regularization ; Rendering (computer graphics) ; Semantics ; Solid modeling ; Three-dimensional displays ; Training ; vision and language foundation models</subject><ispartof>IEEE transactions on circuits and systems for video technology, 2024-12, Vol.34 (12), p.12923-12936</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c221t-b7646786e6b411fd28d4c3e034dbfd0822cd62919efb09d692fbe4aabbe8fe4b3</cites><orcidid>0000-0003-1368-9364 ; 0000-0002-5714-1926 ; 0000-0002-6293-5464 ; 0000-0002-7816-9733</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10630553$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,27903,27904,54736</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/10630553$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Liao, Guibiao</creatorcontrib><creatorcontrib>Zhou, Kaichen</creatorcontrib><creatorcontrib>Bao, Zhenyu</creatorcontrib><creatorcontrib>Liu, Kanglin</creatorcontrib><creatorcontrib>Li, Qing</creatorcontrib><title>OV-NeRF: Open-Vocabulary Neural Radiance Fields With Vision and Language Foundation Models for 3D Semantic Understanding</title><title>IEEE transactions on circuits and systems for video technology</title><addtitle>TCSVT</addtitle><description>The development of Neural Radiance Fields (NeRFs) has provided a potent representation for encapsulating the geometric and appearance characteristics of 3D scenes. Enhancing the capabilities of NeRFs in open-vocabulary 3D semantic perception tasks has been a recent focus. However, current methods that extract semantics directly from Contrastive Language-Image Pretraining (CLIP) for semantic field learning encounter difficulties due to noisy and view-inconsistent semantics provided by CLIP. To tackle these limitations, we propose OV-NeRF, which exploits the potential of pre-trained vision and language foundation models to enhance semantic field learning through proposed single-view and cross-view strategies. First, from the single-view perspective, we introduce Region Semantic Ranking (RSR) regularization by leveraging 2D mask proposals derived from Segment Anything (SAM) to rectify the noisy semantics of each training view, facilitating accurate semantic field learning. Second, from the cross-view perspective, we propose a Cross-view Self-enhancement (CSE) strategy to address the challenge raised by view-inconsistent semantics. Rather than invariably utilizing the 2D inconsistent semantics from CLIP, CSE leverages the 3D consistent semantics generated from the well-trained semantic field itself for semantic field training, aiming to reduce ambiguity and enhance overall semantic consistency across different views. Extensive experiments validate our OV-NeRF outperforms current state-of-the-art methods, achieving a significant improvement of 20.31% and 18.42% in mIoU metric on Replica and ScanNet, respectively. Furthermore, our approach exhibits consistent superior results across various CLIP configurations, further verifying its robustness. Codes are available at: https://github.com/pcl3dv/OV-NeRF .</description><subject>Circuits and systems</subject><subject>cross-view self-enhancement</subject><subject>Learning</subject><subject>Neural radiance field</subject><subject>open-vocabulary</subject><subject>Radiance</subject><subject>Regularization</subject><subject>Rendering (computer graphics)</subject><subject>Semantics</subject><subject>Solid modeling</subject><subject>Three-dimensional displays</subject><subject>Training</subject><subject>vision and language foundation models</subject><issn>1051-8215</issn><issn>1558-2205</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpNkE1Lw0AQhoMoWKt_QDwseE7dz3x4k2pVqC30Ix7Dbna2bombupuA_ntT24OnGWbeZwaeKLomeEQIzu9W42WxGlFM-YhxlqcsPYkGRIgsphSL077HgsQZJeI8ughhizHhGU8H0fe8iGewmNyj-Q5cXDSVVF0t_Q-aQedljRZSW-kqQBMLtQ7o3bYfqLDBNg5Jp9FUuk0nN_2-6ZyW7X7-1mioAzKNR-wRLeFTutZWaO00-ND2lHWby-jMyDrA1bEOo_XkaTV-iafz59fxwzSuKCVtrNKEJ2mWQKI4IUbTTPOKAWZcK6NxRmmlE5qTHIzCuU5yahRwKZWCzABXbBjdHu7ufPPVQWjLbdN5178sGeGpoJngok_RQ6ryTQgeTLnz9rPXUBJc7g2Xf4bLveHyaLiHbg6QBYB_QMKwEIz9AvJWeI8</recordid><startdate>20241201</startdate><enddate>20241201</enddate><creator>Liao, Guibiao</creator><creator>Zhou, Kaichen</creator><creator>Bao, Zhenyu</creator><creator>Liu, Kanglin</creator><creator>Li, Qing</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0003-1368-9364</orcidid><orcidid>https://orcid.org/0000-0002-5714-1926</orcidid><orcidid>https://orcid.org/0000-0002-6293-5464</orcidid><orcidid>https://orcid.org/0000-0002-7816-9733</orcidid></search><sort><creationdate>20241201</creationdate><title>OV-NeRF: Open-Vocabulary Neural Radiance Fields With Vision and Language Foundation Models for 3D Semantic Understanding</title><author>Liao, Guibiao ; Zhou, Kaichen ; Bao, Zhenyu ; Liu, Kanglin ; Li, Qing</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c221t-b7646786e6b411fd28d4c3e034dbfd0822cd62919efb09d692fbe4aabbe8fe4b3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Circuits and systems</topic><topic>cross-view self-enhancement</topic><topic>Learning</topic><topic>Neural radiance field</topic><topic>open-vocabulary</topic><topic>Radiance</topic><topic>Regularization</topic><topic>Rendering (computer graphics)</topic><topic>Semantics</topic><topic>Solid modeling</topic><topic>Three-dimensional displays</topic><topic>Training</topic><topic>vision and language foundation models</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Liao, Guibiao</creatorcontrib><creatorcontrib>Zhou, Kaichen</creatorcontrib><creatorcontrib>Bao, Zhenyu</creatorcontrib><creatorcontrib>Liu, Kanglin</creatorcontrib><creatorcontrib>Li, Qing</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE transactions on circuits and systems for video technology</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Liao, Guibiao</au><au>Zhou, Kaichen</au><au>Bao, Zhenyu</au><au>Liu, Kanglin</au><au>Li, Qing</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>OV-NeRF: Open-Vocabulary Neural Radiance Fields With Vision and Language Foundation Models for 3D Semantic Understanding</atitle><jtitle>IEEE transactions on circuits and systems for video technology</jtitle><stitle>TCSVT</stitle><date>2024-12-01</date><risdate>2024</risdate><volume>34</volume><issue>12</issue><spage>12923</spage><epage>12936</epage><pages>12923-12936</pages><issn>1051-8215</issn><eissn>1558-2205</eissn><coden>ITCTEM</coden><abstract>The development of Neural Radiance Fields (NeRFs) has provided a potent representation for encapsulating the geometric and appearance characteristics of 3D scenes. Enhancing the capabilities of NeRFs in open-vocabulary 3D semantic perception tasks has been a recent focus. However, current methods that extract semantics directly from Contrastive Language-Image Pretraining (CLIP) for semantic field learning encounter difficulties due to noisy and view-inconsistent semantics provided by CLIP. To tackle these limitations, we propose OV-NeRF, which exploits the potential of pre-trained vision and language foundation models to enhance semantic field learning through proposed single-view and cross-view strategies. First, from the single-view perspective, we introduce Region Semantic Ranking (RSR) regularization by leveraging 2D mask proposals derived from Segment Anything (SAM) to rectify the noisy semantics of each training view, facilitating accurate semantic field learning. Second, from the cross-view perspective, we propose a Cross-view Self-enhancement (CSE) strategy to address the challenge raised by view-inconsistent semantics. Rather than invariably utilizing the 2D inconsistent semantics from CLIP, CSE leverages the 3D consistent semantics generated from the well-trained semantic field itself for semantic field training, aiming to reduce ambiguity and enhance overall semantic consistency across different views. Extensive experiments validate our OV-NeRF outperforms current state-of-the-art methods, achieving a significant improvement of 20.31% and 18.42% in mIoU metric on Replica and ScanNet, respectively. Furthermore, our approach exhibits consistent superior results across various CLIP configurations, further verifying its robustness. Codes are available at: https://github.com/pcl3dv/OV-NeRF .</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/TCSVT.2024.3439737</doi><tpages>14</tpages><orcidid>https://orcid.org/0000-0003-1368-9364</orcidid><orcidid>https://orcid.org/0000-0002-5714-1926</orcidid><orcidid>https://orcid.org/0000-0002-6293-5464</orcidid><orcidid>https://orcid.org/0000-0002-7816-9733</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1051-8215
ispartof	IEEE transactions on circuits and systems for video technology, 2024-12, Vol.34 (12), p.12923-12936
issn	1051-8215 1558-2205
language	eng
recordid	cdi_proquest_journals_3147528545
source	IEEE Electronic Library (IEL)
subjects	Circuits and systems cross-view self-enhancement Learning Neural radiance field open-vocabulary Radiance Regularization Rendering (computer graphics) Semantics Solid modeling Three-dimensional displays Training vision and language foundation models
title	OV-NeRF: Open-Vocabulary Neural Radiance Fields With Vision and Language Foundation Models for 3D Semantic Understanding
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-24T10%3A57%3A10IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=OV-NeRF:%20Open-Vocabulary%20Neural%20Radiance%20Fields%20With%20Vision%20and%20Language%20Foundation%20Models%20for%203D%20Semantic%20Understanding&rft.jtitle=IEEE%20transactions%20on%20circuits%20and%20systems%20for%20video%20technology&rft.au=Liao,%20Guibiao&rft.date=2024-12-01&rft.volume=34&rft.issue=12&rft.spage=12923&rft.epage=12936&rft.pages=12923-12936&rft.issn=1051-8215&rft.eissn=1558-2205&rft.coden=ITCTEM&rft_id=info:doi/10.1109/TCSVT.2024.3439737&rft_dat=%3Cproquest_RIE%3E3147528545%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3147528545&rft_id=info:pmid/&rft_ieee_id=10630553&rfr_iscdi=true