DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification

Large pre-trained models have had a significant impact on computer vision by enabling multi-modal learning, where the CLIP model has achieved impressive results in image classification, object detection, and semantic segmentation. However, the model's performance on 3D point cloud processing ta...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-05
Hauptverfasser:	Shen, Sitian, Zhu, Zilin, Fan, Linqian, Zhang, Harry, Wu, Xinxiao
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy Classification Computer vision Domains Image classification Image segmentation Object recognition Semantic segmentation Three dimensional models Training
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Shen, Sitian Zhu, Zilin Fan, Linqian Zhang, Harry Wu, Xinxiao
description	Large pre-trained models have had a significant impact on computer vision by enabling multi-modal learning, where the CLIP model has achieved impressive results in image classification, object detection, and semantic segmentation. However, the model's performance on 3D point cloud processing tasks is limited due to the domain gap between depth maps from 3D projection and training images of CLIP. This paper proposes DiffCLIP, a new pre-training framework that incorporates stable diffusion with ControlNet to minimize the domain gap in the visual branch. Additionally, a style-prompt generation module is introduced for few-shot tasks in the textual branch. Extensive experiments on the ModelNet10, ModelNet40, and ScanObjectNN datasets show that DiffCLIP has strong abilities for 3D understanding. By using stable diffusion and style-prompt generation, DiffCLIP achieves an accuracy of 43.2\% for zero-shot classification on OBJ\_BG of ScanObjectNN, which is state-of-the-art performance, and an accuracy of 80.6\% for zero-shot classification on ModelNet10, which is comparable to state-of-the-art performance.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2819552289</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2819552289</sourcerecordid><originalsourceid>FETCH-proquest_journals_28195522893</originalsourceid><addsrcrecordid>eNqNyt0KgjAYgOERBEl5Dx90LOi3Vtqp9gcGQZ3Lym1MZKvNdf0ZdAEdvQfPOyERUpol-QpxRmLvuzRNcb1BxmhEzpWWsqxPly3U4i0cV9oouA783gv4WvDaGpDWQc2NClwJODgbTCtaoBWUPfdeS_3gw_gtyFTy3ov41zlZ7ne38pg8nX0F4Yems8GZkRrMs4IxxLyg_10f9kk83w</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2819552289</pqid></control><display><type>article</type><title>DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification</title><source>Free E- Journals</source><creator>Shen, Sitian ; Zhu, Zilin ; Fan, Linqian ; Zhang, Harry ; Wu, Xinxiao</creator><creatorcontrib>Shen, Sitian ; Zhu, Zilin ; Fan, Linqian ; Zhang, Harry ; Wu, Xinxiao</creatorcontrib><description>Large pre-trained models have had a significant impact on computer vision by enabling multi-modal learning, where the CLIP model has achieved impressive results in image classification, object detection, and semantic segmentation. However, the model's performance on 3D point cloud processing tasks is limited due to the domain gap between depth maps from 3D projection and training images of CLIP. This paper proposes DiffCLIP, a new pre-training framework that incorporates stable diffusion with ControlNet to minimize the domain gap in the visual branch. Additionally, a style-prompt generation module is introduced for few-shot tasks in the textual branch. Extensive experiments on the ModelNet10, ModelNet40, and ScanObjectNN datasets show that DiffCLIP has strong abilities for 3D understanding. By using stable diffusion and style-prompt generation, DiffCLIP achieves an accuracy of 43.2\% for zero-shot classification on OBJ\_BG of ScanObjectNN, which is state-of-the-art performance, and an accuracy of 80.6\% for zero-shot classification on ModelNet10, which is comparable to state-of-the-art performance.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Accuracy ; Classification ; Computer vision ; Domains ; Image classification ; Image segmentation ; Object recognition ; Semantic segmentation ; Three dimensional models ; Training</subject><ispartof>arXiv.org, 2024-05</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Shen, Sitian</creatorcontrib><creatorcontrib>Zhu, Zilin</creatorcontrib><creatorcontrib>Fan, Linqian</creatorcontrib><creatorcontrib>Zhang, Harry</creatorcontrib><creatorcontrib>Wu, Xinxiao</creatorcontrib><title>DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification</title><title>arXiv.org</title><description>Large pre-trained models have had a significant impact on computer vision by enabling multi-modal learning, where the CLIP model has achieved impressive results in image classification, object detection, and semantic segmentation. However, the model's performance on 3D point cloud processing tasks is limited due to the domain gap between depth maps from 3D projection and training images of CLIP. This paper proposes DiffCLIP, a new pre-training framework that incorporates stable diffusion with ControlNet to minimize the domain gap in the visual branch. Additionally, a style-prompt generation module is introduced for few-shot tasks in the textual branch. Extensive experiments on the ModelNet10, ModelNet40, and ScanObjectNN datasets show that DiffCLIP has strong abilities for 3D understanding. By using stable diffusion and style-prompt generation, DiffCLIP achieves an accuracy of 43.2\% for zero-shot classification on OBJ\_BG of ScanObjectNN, which is state-of-the-art performance, and an accuracy of 80.6\% for zero-shot classification on ModelNet10, which is comparable to state-of-the-art performance.</description><subject>Accuracy</subject><subject>Classification</subject><subject>Computer vision</subject><subject>Domains</subject><subject>Image classification</subject><subject>Image segmentation</subject><subject>Object recognition</subject><subject>Semantic segmentation</subject><subject>Three dimensional models</subject><subject>Training</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNyt0KgjAYgOERBEl5Dx90LOi3Vtqp9gcGQZ3Lym1MZKvNdf0ZdAEdvQfPOyERUpol-QpxRmLvuzRNcb1BxmhEzpWWsqxPly3U4i0cV9oouA783gv4WvDaGpDWQc2NClwJODgbTCtaoBWUPfdeS_3gw_gtyFTy3ov41zlZ7ne38pg8nX0F4Yems8GZkRrMs4IxxLyg_10f9kk83w</recordid><startdate>20240506</startdate><enddate>20240506</enddate><creator>Shen, Sitian</creator><creator>Zhu, Zilin</creator><creator>Fan, Linqian</creator><creator>Zhang, Harry</creator><creator>Wu, Xinxiao</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240506</creationdate><title>DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification</title><author>Shen, Sitian ; Zhu, Zilin ; Fan, Linqian ; Zhang, Harry ; Wu, Xinxiao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28195522893</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Classification</topic><topic>Computer vision</topic><topic>Domains</topic><topic>Image classification</topic><topic>Image segmentation</topic><topic>Object recognition</topic><topic>Semantic segmentation</topic><topic>Three dimensional models</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Shen, Sitian</creatorcontrib><creatorcontrib>Zhu, Zilin</creatorcontrib><creatorcontrib>Fan, Linqian</creatorcontrib><creatorcontrib>Zhang, Harry</creatorcontrib><creatorcontrib>Wu, Xinxiao</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Shen, Sitian</au><au>Zhu, Zilin</au><au>Fan, Linqian</au><au>Zhang, Harry</au><au>Wu, Xinxiao</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification</atitle><jtitle>arXiv.org</jtitle><date>2024-05-06</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Large pre-trained models have had a significant impact on computer vision by enabling multi-modal learning, where the CLIP model has achieved impressive results in image classification, object detection, and semantic segmentation. However, the model's performance on 3D point cloud processing tasks is limited due to the domain gap between depth maps from 3D projection and training images of CLIP. This paper proposes DiffCLIP, a new pre-training framework that incorporates stable diffusion with ControlNet to minimize the domain gap in the visual branch. Additionally, a style-prompt generation module is introduced for few-shot tasks in the textual branch. Extensive experiments on the ModelNet10, ModelNet40, and ScanObjectNN datasets show that DiffCLIP has strong abilities for 3D understanding. By using stable diffusion and style-prompt generation, DiffCLIP achieves an accuracy of 43.2\% for zero-shot classification on OBJ\_BG of ScanObjectNN, which is state-of-the-art performance, and an accuracy of 80.6\% for zero-shot classification on ModelNet10, which is comparable to state-of-the-art performance.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-05
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2819552289
source	Free E- Journals
subjects	Accuracy Classification Computer vision Domains Image classification Image segmentation Object recognition Semantic segmentation Three dimensional models Training
title	DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-03T22%3A29%3A41IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=DiffCLIP:%20Leveraging%20Stable%20Diffusion%20for%20Language%20Grounded%203D%20Classification&rft.jtitle=arXiv.org&rft.au=Shen,%20Sitian&rft.date=2024-05-06&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2819552289%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2819552289&rft_id=info:pmid/&rfr_iscdi=true