DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification

Large pre-trained models have had a significant impact on computer vision by enabling multi-modal learning, where the CLIP model has achieved impressive results in image classification, object detection, and semantic segmentation. However, the model's performance on 3D point cloud processing ta...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2024-05
Hauptverfasser: Shen, Sitian, Zhu, Zilin, Fan, Linqian, Zhang, Harry, Wu, Xinxiao
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Shen, Sitian
Zhu, Zilin
Fan, Linqian
Zhang, Harry
Wu, Xinxiao
description Large pre-trained models have had a significant impact on computer vision by enabling multi-modal learning, where the CLIP model has achieved impressive results in image classification, object detection, and semantic segmentation. However, the model's performance on 3D point cloud processing tasks is limited due to the domain gap between depth maps from 3D projection and training images of CLIP. This paper proposes DiffCLIP, a new pre-training framework that incorporates stable diffusion with ControlNet to minimize the domain gap in the visual branch. Additionally, a style-prompt generation module is introduced for few-shot tasks in the textual branch. Extensive experiments on the ModelNet10, ModelNet40, and ScanObjectNN datasets show that DiffCLIP has strong abilities for 3D understanding. By using stable diffusion and style-prompt generation, DiffCLIP achieves an accuracy of 43.2\% for zero-shot classification on OBJ\_BG of ScanObjectNN, which is state-of-the-art performance, and an accuracy of 80.6\% for zero-shot classification on ModelNet10, which is comparable to state-of-the-art performance.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2819552289</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2819552289</sourcerecordid><originalsourceid>FETCH-proquest_journals_28195522893</originalsourceid><addsrcrecordid>eNqNyt0KgjAYgOERBEl5Dx90LOi3Vtqp9gcGQZ3Lym1MZKvNdf0ZdAEdvQfPOyERUpol-QpxRmLvuzRNcb1BxmhEzpWWsqxPly3U4i0cV9oouA783gv4WvDaGpDWQc2NClwJODgbTCtaoBWUPfdeS_3gw_gtyFTy3ov41zlZ7ne38pg8nX0F4Yems8GZkRrMs4IxxLyg_10f9kk83w</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2819552289</pqid></control><display><type>article</type><title>DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification</title><source>Free E- Journals</source><creator>Shen, Sitian ; Zhu, Zilin ; Fan, Linqian ; Zhang, Harry ; Wu, Xinxiao</creator><creatorcontrib>Shen, Sitian ; Zhu, Zilin ; Fan, Linqian ; Zhang, Harry ; Wu, Xinxiao</creatorcontrib><description>Large pre-trained models have had a significant impact on computer vision by enabling multi-modal learning, where the CLIP model has achieved impressive results in image classification, object detection, and semantic segmentation. However, the model's performance on 3D point cloud processing tasks is limited due to the domain gap between depth maps from 3D projection and training images of CLIP. This paper proposes DiffCLIP, a new pre-training framework that incorporates stable diffusion with ControlNet to minimize the domain gap in the visual branch. Additionally, a style-prompt generation module is introduced for few-shot tasks in the textual branch. Extensive experiments on the ModelNet10, ModelNet40, and ScanObjectNN datasets show that DiffCLIP has strong abilities for 3D understanding. By using stable diffusion and style-prompt generation, DiffCLIP achieves an accuracy of 43.2\% for zero-shot classification on OBJ\_BG of ScanObjectNN, which is state-of-the-art performance, and an accuracy of 80.6\% for zero-shot classification on ModelNet10, which is comparable to state-of-the-art performance.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Accuracy ; Classification ; Computer vision ; Domains ; Image classification ; Image segmentation ; Object recognition ; Semantic segmentation ; Three dimensional models ; Training</subject><ispartof>arXiv.org, 2024-05</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Shen, Sitian</creatorcontrib><creatorcontrib>Zhu, Zilin</creatorcontrib><creatorcontrib>Fan, Linqian</creatorcontrib><creatorcontrib>Zhang, Harry</creatorcontrib><creatorcontrib>Wu, Xinxiao</creatorcontrib><title>DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification</title><title>arXiv.org</title><description>Large pre-trained models have had a significant impact on computer vision by enabling multi-modal learning, where the CLIP model has achieved impressive results in image classification, object detection, and semantic segmentation. However, the model's performance on 3D point cloud processing tasks is limited due to the domain gap between depth maps from 3D projection and training images of CLIP. This paper proposes DiffCLIP, a new pre-training framework that incorporates stable diffusion with ControlNet to minimize the domain gap in the visual branch. Additionally, a style-prompt generation module is introduced for few-shot tasks in the textual branch. Extensive experiments on the ModelNet10, ModelNet40, and ScanObjectNN datasets show that DiffCLIP has strong abilities for 3D understanding. By using stable diffusion and style-prompt generation, DiffCLIP achieves an accuracy of 43.2\% for zero-shot classification on OBJ\_BG of ScanObjectNN, which is state-of-the-art performance, and an accuracy of 80.6\% for zero-shot classification on ModelNet10, which is comparable to state-of-the-art performance.</description><subject>Accuracy</subject><subject>Classification</subject><subject>Computer vision</subject><subject>Domains</subject><subject>Image classification</subject><subject>Image segmentation</subject><subject>Object recognition</subject><subject>Semantic segmentation</subject><subject>Three dimensional models</subject><subject>Training</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNyt0KgjAYgOERBEl5Dx90LOi3Vtqp9gcGQZ3Lym1MZKvNdf0ZdAEdvQfPOyERUpol-QpxRmLvuzRNcb1BxmhEzpWWsqxPly3U4i0cV9oouA783gv4WvDaGpDWQc2NClwJODgbTCtaoBWUPfdeS_3gw_gtyFTy3ov41zlZ7ne38pg8nX0F4Yems8GZkRrMs4IxxLyg_10f9kk83w</recordid><startdate>20240506</startdate><enddate>20240506</enddate><creator>Shen, Sitian</creator><creator>Zhu, Zilin</creator><creator>Fan, Linqian</creator><creator>Zhang, Harry</creator><creator>Wu, Xinxiao</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240506</creationdate><title>DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification</title><author>Shen, Sitian ; Zhu, Zilin ; Fan, Linqian ; Zhang, Harry ; Wu, Xinxiao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28195522893</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Classification</topic><topic>Computer vision</topic><topic>Domains</topic><topic>Image classification</topic><topic>Image segmentation</topic><topic>Object recognition</topic><topic>Semantic segmentation</topic><topic>Three dimensional models</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Shen, Sitian</creatorcontrib><creatorcontrib>Zhu, Zilin</creatorcontrib><creatorcontrib>Fan, Linqian</creatorcontrib><creatorcontrib>Zhang, Harry</creatorcontrib><creatorcontrib>Wu, Xinxiao</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Shen, Sitian</au><au>Zhu, Zilin</au><au>Fan, Linqian</au><au>Zhang, Harry</au><au>Wu, Xinxiao</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification</atitle><jtitle>arXiv.org</jtitle><date>2024-05-06</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Large pre-trained models have had a significant impact on computer vision by enabling multi-modal learning, where the CLIP model has achieved impressive results in image classification, object detection, and semantic segmentation. However, the model's performance on 3D point cloud processing tasks is limited due to the domain gap between depth maps from 3D projection and training images of CLIP. This paper proposes DiffCLIP, a new pre-training framework that incorporates stable diffusion with ControlNet to minimize the domain gap in the visual branch. Additionally, a style-prompt generation module is introduced for few-shot tasks in the textual branch. Extensive experiments on the ModelNet10, ModelNet40, and ScanObjectNN datasets show that DiffCLIP has strong abilities for 3D understanding. By using stable diffusion and style-prompt generation, DiffCLIP achieves an accuracy of 43.2\% for zero-shot classification on OBJ\_BG of ScanObjectNN, which is state-of-the-art performance, and an accuracy of 80.6\% for zero-shot classification on ModelNet10, which is comparable to state-of-the-art performance.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2024-05
issn 2331-8422
language eng
recordid cdi_proquest_journals_2819552289
source Free E- Journals
subjects Accuracy
Classification
Computer vision
Domains
Image classification
Image segmentation
Object recognition
Semantic segmentation
Three dimensional models
Training
title DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-03T22%3A29%3A41IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=DiffCLIP:%20Leveraging%20Stable%20Diffusion%20for%20Language%20Grounded%203D%20Classification&rft.jtitle=arXiv.org&rft.au=Shen,%20Sitian&rft.date=2024-05-06&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2819552289%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2819552289&rft_id=info:pmid/&rfr_iscdi=true