Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer

Generating high-quality 3D assets from text and images has long been challenging, primarily due to the absence of scalable 3D representations capable of capturing intricate geometry distributions. In this work, we introduce Direct3D, a native 3D generative model scalable to in-the-wild input images,...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-06
Hauptverfasser:	Wu, Shuang, Lin, Youtian, Zhang, Feihu, Zeng, Yifei, Xu, Jingxi, Torr, Philip, Cao, Xun, Yao, Yao
Format:	Artikel
Sprache:	eng
Schlagworte:	Feature maps Image quality Transformers
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Wu, Shuang Lin, Youtian Zhang, Feihu Zeng, Yifei Xu, Jingxi Torr, Philip Cao, Xun Yao, Yao
description	Generating high-quality 3D assets from text and images has long been challenging, primarily due to the absence of scalable 3D representations capable of capturing intricate geometry distributions. In this work, we introduce Direct3D, a native 3D generative model scalable to in-the-wild input images, without requiring a multiview diffusion model or SDS optimization. Our approach comprises two primary components: a Direct 3D Variational Auto-Encoder (D3D-VAE) and a Direct 3D Diffusion Transformer (D3D-DiT). D3D-VAE efficiently encodes high-resolution 3D shapes into a compact and continuous latent triplane space. Notably, our method directly supervises the decoded geometry using a semi-continuous surface sampling strategy, diverging from previous methods relying on rendered images as supervision signals. D3D-DiT models the distribution of encoded 3D latents and is specifically designed to fuse positional information from the three feature maps of the triplane latent, enabling a native 3D generative model scalable to large-scale 3D datasets. Additionally, we introduce an innovative image-to-3D generation pipeline incorporating semantic and pixel-level image conditions, allowing the model to produce 3D shapes consistent with the provided conditional image input. Extensive experiments demonstrate the superiority of our large-scale pre-trained Direct3D over previous image-to-3D approaches, achieving significantly better generation quality and generalization ability, thus establishing a new state-of-the-art for 3D content creation. Project page: https://nju-3dv.github.io/projects/Direct3D/.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3059644503</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3059644503</sourcerecordid><originalsourceid>FETCH-proquest_journals_30596445033</originalsourceid><addsrcrecordid>eNqNiksKwjAUAIMgWLR3CLgOxKSpn63xB7qy-_IsL5LSJpqknl8FD-BqYGZGJBNSLtiqEGJC8hhbzrkol0IpmZGLtgGbJPWGXhvo4NYhPfVwR5Y8k5oe0GGAZL2jLwv0Y86Q0CWqrTFD_PoqgIvGhx7DjIwNdBHzH6dkvt9V2yN7BP8cMKa69UNwn1RLrtZlUSgu5X_XG6EAPD8</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3059644503</pqid></control><display><type>article</type><title>Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer</title><source>Free E- Journals</source><creator>Wu, Shuang ; Lin, Youtian ; Zhang, Feihu ; Zeng, Yifei ; Xu, Jingxi ; Torr, Philip ; Cao, Xun ; Yao, Yao</creator><creatorcontrib>Wu, Shuang ; Lin, Youtian ; Zhang, Feihu ; Zeng, Yifei ; Xu, Jingxi ; Torr, Philip ; Cao, Xun ; Yao, Yao</creatorcontrib><description>Generating high-quality 3D assets from text and images has long been challenging, primarily due to the absence of scalable 3D representations capable of capturing intricate geometry distributions. In this work, we introduce Direct3D, a native 3D generative model scalable to in-the-wild input images, without requiring a multiview diffusion model or SDS optimization. Our approach comprises two primary components: a Direct 3D Variational Auto-Encoder (D3D-VAE) and a Direct 3D Diffusion Transformer (D3D-DiT). D3D-VAE efficiently encodes high-resolution 3D shapes into a compact and continuous latent triplane space. Notably, our method directly supervises the decoded geometry using a semi-continuous surface sampling strategy, diverging from previous methods relying on rendered images as supervision signals. D3D-DiT models the distribution of encoded 3D latents and is specifically designed to fuse positional information from the three feature maps of the triplane latent, enabling a native 3D generative model scalable to large-scale 3D datasets. Additionally, we introduce an innovative image-to-3D generation pipeline incorporating semantic and pixel-level image conditions, allowing the model to produce 3D shapes consistent with the provided conditional image input. Extensive experiments demonstrate the superiority of our large-scale pre-trained Direct3D over previous image-to-3D approaches, achieving significantly better generation quality and generalization ability, thus establishing a new state-of-the-art for 3D content creation. Project page: https://nju-3dv.github.io/projects/Direct3D/.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Feature maps ; Image quality ; Transformers</subject><ispartof>arXiv.org, 2024-06</ispartof><rights>2024. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Wu, Shuang</creatorcontrib><creatorcontrib>Lin, Youtian</creatorcontrib><creatorcontrib>Zhang, Feihu</creatorcontrib><creatorcontrib>Zeng, Yifei</creatorcontrib><creatorcontrib>Xu, Jingxi</creatorcontrib><creatorcontrib>Torr, Philip</creatorcontrib><creatorcontrib>Cao, Xun</creatorcontrib><creatorcontrib>Yao, Yao</creatorcontrib><title>Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer</title><title>arXiv.org</title><description>Generating high-quality 3D assets from text and images has long been challenging, primarily due to the absence of scalable 3D representations capable of capturing intricate geometry distributions. In this work, we introduce Direct3D, a native 3D generative model scalable to in-the-wild input images, without requiring a multiview diffusion model or SDS optimization. Our approach comprises two primary components: a Direct 3D Variational Auto-Encoder (D3D-VAE) and a Direct 3D Diffusion Transformer (D3D-DiT). D3D-VAE efficiently encodes high-resolution 3D shapes into a compact and continuous latent triplane space. Notably, our method directly supervises the decoded geometry using a semi-continuous surface sampling strategy, diverging from previous methods relying on rendered images as supervision signals. D3D-DiT models the distribution of encoded 3D latents and is specifically designed to fuse positional information from the three feature maps of the triplane latent, enabling a native 3D generative model scalable to large-scale 3D datasets. Additionally, we introduce an innovative image-to-3D generation pipeline incorporating semantic and pixel-level image conditions, allowing the model to produce 3D shapes consistent with the provided conditional image input. Extensive experiments demonstrate the superiority of our large-scale pre-trained Direct3D over previous image-to-3D approaches, achieving significantly better generation quality and generalization ability, thus establishing a new state-of-the-art for 3D content creation. Project page: https://nju-3dv.github.io/projects/Direct3D/.</description><subject>Feature maps</subject><subject>Image quality</subject><subject>Transformers</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNiksKwjAUAIMgWLR3CLgOxKSpn63xB7qy-_IsL5LSJpqknl8FD-BqYGZGJBNSLtiqEGJC8hhbzrkol0IpmZGLtgGbJPWGXhvo4NYhPfVwR5Y8k5oe0GGAZL2jLwv0Y86Q0CWqrTFD_PoqgIvGhx7DjIwNdBHzH6dkvt9V2yN7BP8cMKa69UNwn1RLrtZlUSgu5X_XG6EAPD8</recordid><startdate>20240601</startdate><enddate>20240601</enddate><creator>Wu, Shuang</creator><creator>Lin, Youtian</creator><creator>Zhang, Feihu</creator><creator>Zeng, Yifei</creator><creator>Xu, Jingxi</creator><creator>Torr, Philip</creator><creator>Cao, Xun</creator><creator>Yao, Yao</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PHGZM</scope><scope>PHGZT</scope><scope>PIMPY</scope><scope>PKEHL</scope><scope>PQEST</scope><scope>PQGLB</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240601</creationdate><title>Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer</title><author>Wu, Shuang ; Lin, Youtian ; Zhang, Feihu ; Zeng, Yifei ; Xu, Jingxi ; Torr, Philip ; Cao, Xun ; Yao, Yao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30596445033</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Feature maps</topic><topic>Image quality</topic><topic>Transformers</topic><toplevel>online_resources</toplevel><creatorcontrib>Wu, Shuang</creatorcontrib><creatorcontrib>Lin, Youtian</creatorcontrib><creatorcontrib>Zhang, Feihu</creatorcontrib><creatorcontrib>Zeng, Yifei</creatorcontrib><creatorcontrib>Xu, Jingxi</creatorcontrib><creatorcontrib>Torr, Philip</creatorcontrib><creatorcontrib>Cao, Xun</creatorcontrib><creatorcontrib>Yao, Yao</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>ProQuest Central (New)</collection><collection>ProQuest One Academic (New)</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Middle East (New)</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Applied & Life Sciences</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Wu, Shuang</au><au>Lin, Youtian</au><au>Zhang, Feihu</au><au>Zeng, Yifei</au><au>Xu, Jingxi</au><au>Torr, Philip</au><au>Cao, Xun</au><au>Yao, Yao</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer</atitle><jtitle>arXiv.org</jtitle><date>2024-06-01</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Generating high-quality 3D assets from text and images has long been challenging, primarily due to the absence of scalable 3D representations capable of capturing intricate geometry distributions. In this work, we introduce Direct3D, a native 3D generative model scalable to in-the-wild input images, without requiring a multiview diffusion model or SDS optimization. Our approach comprises two primary components: a Direct 3D Variational Auto-Encoder (D3D-VAE) and a Direct 3D Diffusion Transformer (D3D-DiT). D3D-VAE efficiently encodes high-resolution 3D shapes into a compact and continuous latent triplane space. Notably, our method directly supervises the decoded geometry using a semi-continuous surface sampling strategy, diverging from previous methods relying on rendered images as supervision signals. D3D-DiT models the distribution of encoded 3D latents and is specifically designed to fuse positional information from the three feature maps of the triplane latent, enabling a native 3D generative model scalable to large-scale 3D datasets. Additionally, we introduce an innovative image-to-3D generation pipeline incorporating semantic and pixel-level image conditions, allowing the model to produce 3D shapes consistent with the provided conditional image input. Extensive experiments demonstrate the superiority of our large-scale pre-trained Direct3D over previous image-to-3D approaches, achieving significantly better generation quality and generalization ability, thus establishing a new state-of-the-art for 3D content creation. Project page: https://nju-3dv.github.io/projects/Direct3D/.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-06
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3059644503
source	Free E- Journals
subjects	Feature maps Image quality Transformers
title	Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-20T03%3A38%3A39IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Direct3D:%20Scalable%20Image-to-3D%20Generation%20via%203D%20Latent%20Diffusion%20Transformer&rft.jtitle=arXiv.org&rft.au=Wu,%20Shuang&rft.date=2024-06-01&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3059644503%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3059644503&rft_id=info:pmid/&rfr_iscdi=true