DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification
Large pre-trained models have had a significant impact on computer vision by enabling multi-modal learning, where the CLIP model has achieved impressive results in image classification, object detection, and semantic segmentation. However, the model's performance on 3D point cloud processing ta...
Gespeichert in:
Veröffentlicht in: | arXiv.org 2024-05 |
---|---|
Hauptverfasser: | , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | arXiv.org |
container_volume | |
creator | Shen, Sitian Zhu, Zilin Fan, Linqian Zhang, Harry Wu, Xinxiao |
description | Large pre-trained models have had a significant impact on computer vision by enabling multi-modal learning, where the CLIP model has achieved impressive results in image classification, object detection, and semantic segmentation. However, the model's performance on 3D point cloud processing tasks is limited due to the domain gap between depth maps from 3D projection and training images of CLIP. This paper proposes DiffCLIP, a new pre-training framework that incorporates stable diffusion with ControlNet to minimize the domain gap in the visual branch. Additionally, a style-prompt generation module is introduced for few-shot tasks in the textual branch. Extensive experiments on the ModelNet10, ModelNet40, and ScanObjectNN datasets show that DiffCLIP has strong abilities for 3D understanding. By using stable diffusion and style-prompt generation, DiffCLIP achieves an accuracy of 43.2\% for zero-shot classification on OBJ\_BG of ScanObjectNN, which is state-of-the-art performance, and an accuracy of 80.6\% for zero-shot classification on ModelNet10, which is comparable to state-of-the-art performance. |
format | Article |
fullrecord | <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2819552289</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2819552289</sourcerecordid><originalsourceid>FETCH-proquest_journals_28195522893</originalsourceid><addsrcrecordid>eNqNyt0KgjAYgOERBEl5Dx90LOi3Vtqp9gcGQZ3Lym1MZKvNdf0ZdAEdvQfPOyERUpol-QpxRmLvuzRNcb1BxmhEzpWWsqxPly3U4i0cV9oouA783gv4WvDaGpDWQc2NClwJODgbTCtaoBWUPfdeS_3gw_gtyFTy3ov41zlZ7ne38pg8nX0F4Yems8GZkRrMs4IxxLyg_10f9kk83w</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2819552289</pqid></control><display><type>article</type><title>DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification</title><source>Free E- Journals</source><creator>Shen, Sitian ; Zhu, Zilin ; Fan, Linqian ; Zhang, Harry ; Wu, Xinxiao</creator><creatorcontrib>Shen, Sitian ; Zhu, Zilin ; Fan, Linqian ; Zhang, Harry ; Wu, Xinxiao</creatorcontrib><description>Large pre-trained models have had a significant impact on computer vision by enabling multi-modal learning, where the CLIP model has achieved impressive results in image classification, object detection, and semantic segmentation. However, the model's performance on 3D point cloud processing tasks is limited due to the domain gap between depth maps from 3D projection and training images of CLIP. This paper proposes DiffCLIP, a new pre-training framework that incorporates stable diffusion with ControlNet to minimize the domain gap in the visual branch. Additionally, a style-prompt generation module is introduced for few-shot tasks in the textual branch. Extensive experiments on the ModelNet10, ModelNet40, and ScanObjectNN datasets show that DiffCLIP has strong abilities for 3D understanding. By using stable diffusion and style-prompt generation, DiffCLIP achieves an accuracy of 43.2\% for zero-shot classification on OBJ\_BG of ScanObjectNN, which is state-of-the-art performance, and an accuracy of 80.6\% for zero-shot classification on ModelNet10, which is comparable to state-of-the-art performance.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Accuracy ; Classification ; Computer vision ; Domains ; Image classification ; Image segmentation ; Object recognition ; Semantic segmentation ; Three dimensional models ; Training</subject><ispartof>arXiv.org, 2024-05</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Shen, Sitian</creatorcontrib><creatorcontrib>Zhu, Zilin</creatorcontrib><creatorcontrib>Fan, Linqian</creatorcontrib><creatorcontrib>Zhang, Harry</creatorcontrib><creatorcontrib>Wu, Xinxiao</creatorcontrib><title>DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification</title><title>arXiv.org</title><description>Large pre-trained models have had a significant impact on computer vision by enabling multi-modal learning, where the CLIP model has achieved impressive results in image classification, object detection, and semantic segmentation. However, the model's performance on 3D point cloud processing tasks is limited due to the domain gap between depth maps from 3D projection and training images of CLIP. This paper proposes DiffCLIP, a new pre-training framework that incorporates stable diffusion with ControlNet to minimize the domain gap in the visual branch. Additionally, a style-prompt generation module is introduced for few-shot tasks in the textual branch. Extensive experiments on the ModelNet10, ModelNet40, and ScanObjectNN datasets show that DiffCLIP has strong abilities for 3D understanding. By using stable diffusion and style-prompt generation, DiffCLIP achieves an accuracy of 43.2\% for zero-shot classification on OBJ\_BG of ScanObjectNN, which is state-of-the-art performance, and an accuracy of 80.6\% for zero-shot classification on ModelNet10, which is comparable to state-of-the-art performance.</description><subject>Accuracy</subject><subject>Classification</subject><subject>Computer vision</subject><subject>Domains</subject><subject>Image classification</subject><subject>Image segmentation</subject><subject>Object recognition</subject><subject>Semantic segmentation</subject><subject>Three dimensional models</subject><subject>Training</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNyt0KgjAYgOERBEl5Dx90LOi3Vtqp9gcGQZ3Lym1MZKvNdf0ZdAEdvQfPOyERUpol-QpxRmLvuzRNcb1BxmhEzpWWsqxPly3U4i0cV9oouA783gv4WvDaGpDWQc2NClwJODgbTCtaoBWUPfdeS_3gw_gtyFTy3ov41zlZ7ne38pg8nX0F4Yems8GZkRrMs4IxxLyg_10f9kk83w</recordid><startdate>20240506</startdate><enddate>20240506</enddate><creator>Shen, Sitian</creator><creator>Zhu, Zilin</creator><creator>Fan, Linqian</creator><creator>Zhang, Harry</creator><creator>Wu, Xinxiao</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240506</creationdate><title>DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification</title><author>Shen, Sitian ; Zhu, Zilin ; Fan, Linqian ; Zhang, Harry ; Wu, Xinxiao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28195522893</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Accuracy</topic><topic>Classification</topic><topic>Computer vision</topic><topic>Domains</topic><topic>Image classification</topic><topic>Image segmentation</topic><topic>Object recognition</topic><topic>Semantic segmentation</topic><topic>Three dimensional models</topic><topic>Training</topic><toplevel>online_resources</toplevel><creatorcontrib>Shen, Sitian</creatorcontrib><creatorcontrib>Zhu, Zilin</creatorcontrib><creatorcontrib>Fan, Linqian</creatorcontrib><creatorcontrib>Zhang, Harry</creatorcontrib><creatorcontrib>Wu, Xinxiao</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Shen, Sitian</au><au>Zhu, Zilin</au><au>Fan, Linqian</au><au>Zhang, Harry</au><au>Wu, Xinxiao</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification</atitle><jtitle>arXiv.org</jtitle><date>2024-05-06</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Large pre-trained models have had a significant impact on computer vision by enabling multi-modal learning, where the CLIP model has achieved impressive results in image classification, object detection, and semantic segmentation. However, the model's performance on 3D point cloud processing tasks is limited due to the domain gap between depth maps from 3D projection and training images of CLIP. This paper proposes DiffCLIP, a new pre-training framework that incorporates stable diffusion with ControlNet to minimize the domain gap in the visual branch. Additionally, a style-prompt generation module is introduced for few-shot tasks in the textual branch. Extensive experiments on the ModelNet10, ModelNet40, and ScanObjectNN datasets show that DiffCLIP has strong abilities for 3D understanding. By using stable diffusion and style-prompt generation, DiffCLIP achieves an accuracy of 43.2\% for zero-shot classification on OBJ\_BG of ScanObjectNN, which is state-of-the-art performance, and an accuracy of 80.6\% for zero-shot classification on ModelNet10, which is comparable to state-of-the-art performance.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | EISSN: 2331-8422 |
ispartof | arXiv.org, 2024-05 |
issn | 2331-8422 |
language | eng |
recordid | cdi_proquest_journals_2819552289 |
source | Free E- Journals |
subjects | Accuracy Classification Computer vision Domains Image classification Image segmentation Object recognition Semantic segmentation Three dimensional models Training |
title | DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-03T22%3A29%3A41IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=DiffCLIP:%20Leveraging%20Stable%20Diffusion%20for%20Language%20Grounded%203D%20Classification&rft.jtitle=arXiv.org&rft.au=Shen,%20Sitian&rft.date=2024-05-06&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2819552289%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2819552289&rft_id=info:pmid/&rfr_iscdi=true |