Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model
In this paper, we propose an effective two-stage approach named Grounded-Dreamer to generate 3D assets that can accurately follow complex, compositional text prompts while achieving high fidelity by using a pre-trained multi-view diffusion model. Multi-view diffusion models, such as MVDream, have sh...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Li, Xiaolong Mo, Jiawei Wang, Ying Parameshwara, Chethan Fei, Xiaohan Swaminathan, Ashwin Taylor, CJ Tu, Zhuowen Favaro, Paolo Soatto, Stefano |
description | In this paper, we propose an effective two-stage approach named
Grounded-Dreamer to generate 3D assets that can accurately follow complex,
compositional text prompts while achieving high fidelity by using a pre-trained
multi-view diffusion model. Multi-view diffusion models, such as MVDream, have
shown to generate high-fidelity 3D assets using score distillation sampling
(SDS). However, applied naively, these methods often fail to comprehend
compositional text prompts, and may often entirely omit certain subjects or
parts. To address this issue, we first advocate leveraging text-guided 4-view
images as the bottleneck in the text-to-3D pipeline. We then introduce an
attention refocusing mechanism to encourage text-aligned 4-view image
generation, without the necessity to re-train the multi-view diffusion model or
craft a high-quality compositional 3D dataset. We further propose a hybrid
optimization strategy to encourage synergy between the SDS loss and the sparse
RGB reference images. Our method consistently outperforms previous
state-of-the-art (SOTA) methods in generating compositional 3D assets,
excelling in both quality and accuracy, and enabling diverse 3D from the same
text prompt. |
doi_str_mv | 10.48550/arxiv.2404.18065 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2404_18065</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2404_18065</sourcerecordid><originalsourceid>FETCH-LOGICAL-a675-f5347de064f988d82e6f5c9278a12fe7c83d58701a696674185ddee76ce9b5af3</originalsourceid><addsrcrecordid>eNotj71OwzAYRb0woMIDMOEXcLDj34wohYLUCoYIdYvc-LOwlMaV47Tl7Qml013uOdJB6IHRQhgp6ZNN53AsSkFFwQxV8hZtVylOgwOH67g_xDHkEAfbYzs4vAxHSCPgBs6Z5Ej4Ep9C_safCXKyYZihzdTnQL4CnOa399M403gTHfR36MbbfoT76y5Q8_rS1G9k_bF6r5_XxCotiZdcaAdUCV8Z40wJysuuKrWxrPSgO8OdNJoyqyqltGBGOgegVQfVTlrPF-jxX3tJaw8p7G36af8S20si_wXMGkwy</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model</title><source>arXiv.org</source><creator>Li, Xiaolong ; Mo, Jiawei ; Wang, Ying ; Parameshwara, Chethan ; Fei, Xiaohan ; Swaminathan, Ashwin ; Taylor, CJ ; Tu, Zhuowen ; Favaro, Paolo ; Soatto, Stefano</creator><creatorcontrib>Li, Xiaolong ; Mo, Jiawei ; Wang, Ying ; Parameshwara, Chethan ; Fei, Xiaohan ; Swaminathan, Ashwin ; Taylor, CJ ; Tu, Zhuowen ; Favaro, Paolo ; Soatto, Stefano</creatorcontrib><description>In this paper, we propose an effective two-stage approach named
Grounded-Dreamer to generate 3D assets that can accurately follow complex,
compositional text prompts while achieving high fidelity by using a pre-trained
multi-view diffusion model. Multi-view diffusion models, such as MVDream, have
shown to generate high-fidelity 3D assets using score distillation sampling
(SDS). However, applied naively, these methods often fail to comprehend
compositional text prompts, and may often entirely omit certain subjects or
parts. To address this issue, we first advocate leveraging text-guided 4-view
images as the bottleneck in the text-to-3D pipeline. We then introduce an
attention refocusing mechanism to encourage text-aligned 4-view image
generation, without the necessity to re-train the multi-view diffusion model or
craft a high-quality compositional 3D dataset. We further propose a hybrid
optimization strategy to encourage synergy between the SDS loss and the sparse
RGB reference images. Our method consistently outperforms previous
state-of-the-art (SOTA) methods in generating compositional 3D assets,
excelling in both quality and accuracy, and enabling diverse 3D from the same
text prompt.</description><identifier>DOI: 10.48550/arxiv.2404.18065</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-04</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2404.18065$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2404.18065$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Li, Xiaolong</creatorcontrib><creatorcontrib>Mo, Jiawei</creatorcontrib><creatorcontrib>Wang, Ying</creatorcontrib><creatorcontrib>Parameshwara, Chethan</creatorcontrib><creatorcontrib>Fei, Xiaohan</creatorcontrib><creatorcontrib>Swaminathan, Ashwin</creatorcontrib><creatorcontrib>Taylor, CJ</creatorcontrib><creatorcontrib>Tu, Zhuowen</creatorcontrib><creatorcontrib>Favaro, Paolo</creatorcontrib><creatorcontrib>Soatto, Stefano</creatorcontrib><title>Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model</title><description>In this paper, we propose an effective two-stage approach named
Grounded-Dreamer to generate 3D assets that can accurately follow complex,
compositional text prompts while achieving high fidelity by using a pre-trained
multi-view diffusion model. Multi-view diffusion models, such as MVDream, have
shown to generate high-fidelity 3D assets using score distillation sampling
(SDS). However, applied naively, these methods often fail to comprehend
compositional text prompts, and may often entirely omit certain subjects or
parts. To address this issue, we first advocate leveraging text-guided 4-view
images as the bottleneck in the text-to-3D pipeline. We then introduce an
attention refocusing mechanism to encourage text-aligned 4-view image
generation, without the necessity to re-train the multi-view diffusion model or
craft a high-quality compositional 3D dataset. We further propose a hybrid
optimization strategy to encourage synergy between the SDS loss and the sparse
RGB reference images. Our method consistently outperforms previous
state-of-the-art (SOTA) methods in generating compositional 3D assets,
excelling in both quality and accuracy, and enabling diverse 3D from the same
text prompt.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj71OwzAYRb0woMIDMOEXcLDj34wohYLUCoYIdYvc-LOwlMaV47Tl7Qml013uOdJB6IHRQhgp6ZNN53AsSkFFwQxV8hZtVylOgwOH67g_xDHkEAfbYzs4vAxHSCPgBs6Z5Ej4Ep9C_safCXKyYZihzdTnQL4CnOa399M403gTHfR36MbbfoT76y5Q8_rS1G9k_bF6r5_XxCotiZdcaAdUCV8Z40wJysuuKrWxrPSgO8OdNJoyqyqltGBGOgegVQfVTlrPF-jxX3tJaw8p7G36af8S20si_wXMGkwy</recordid><startdate>20240428</startdate><enddate>20240428</enddate><creator>Li, Xiaolong</creator><creator>Mo, Jiawei</creator><creator>Wang, Ying</creator><creator>Parameshwara, Chethan</creator><creator>Fei, Xiaohan</creator><creator>Swaminathan, Ashwin</creator><creator>Taylor, CJ</creator><creator>Tu, Zhuowen</creator><creator>Favaro, Paolo</creator><creator>Soatto, Stefano</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240428</creationdate><title>Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model</title><author>Li, Xiaolong ; Mo, Jiawei ; Wang, Ying ; Parameshwara, Chethan ; Fei, Xiaohan ; Swaminathan, Ashwin ; Taylor, CJ ; Tu, Zhuowen ; Favaro, Paolo ; Soatto, Stefano</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a675-f5347de064f988d82e6f5c9278a12fe7c83d58701a696674185ddee76ce9b5af3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Li, Xiaolong</creatorcontrib><creatorcontrib>Mo, Jiawei</creatorcontrib><creatorcontrib>Wang, Ying</creatorcontrib><creatorcontrib>Parameshwara, Chethan</creatorcontrib><creatorcontrib>Fei, Xiaohan</creatorcontrib><creatorcontrib>Swaminathan, Ashwin</creatorcontrib><creatorcontrib>Taylor, CJ</creatorcontrib><creatorcontrib>Tu, Zhuowen</creatorcontrib><creatorcontrib>Favaro, Paolo</creatorcontrib><creatorcontrib>Soatto, Stefano</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Li, Xiaolong</au><au>Mo, Jiawei</au><au>Wang, Ying</au><au>Parameshwara, Chethan</au><au>Fei, Xiaohan</au><au>Swaminathan, Ashwin</au><au>Taylor, CJ</au><au>Tu, Zhuowen</au><au>Favaro, Paolo</au><au>Soatto, Stefano</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model</atitle><date>2024-04-28</date><risdate>2024</risdate><abstract>In this paper, we propose an effective two-stage approach named
Grounded-Dreamer to generate 3D assets that can accurately follow complex,
compositional text prompts while achieving high fidelity by using a pre-trained
multi-view diffusion model. Multi-view diffusion models, such as MVDream, have
shown to generate high-fidelity 3D assets using score distillation sampling
(SDS). However, applied naively, these methods often fail to comprehend
compositional text prompts, and may often entirely omit certain subjects or
parts. To address this issue, we first advocate leveraging text-guided 4-view
images as the bottleneck in the text-to-3D pipeline. We then introduce an
attention refocusing mechanism to encourage text-aligned 4-view image
generation, without the necessity to re-train the multi-view diffusion model or
craft a high-quality compositional 3D dataset. We further propose a hybrid
optimization strategy to encourage synergy between the SDS loss and the sparse
RGB reference images. Our method consistently outperforms previous
state-of-the-art (SOTA) methods in generating compositional 3D assets,
excelling in both quality and accuracy, and enabling diverse 3D from the same
text prompt.</abstract><doi>10.48550/arxiv.2404.18065</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2404.18065 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2404_18065 |
source | arXiv.org |
subjects | Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition |
title | Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T00%3A39%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Grounded%20Compositional%20and%20Diverse%20Text-to-3D%20with%20Pretrained%20Multi-View%20Diffusion%20Model&rft.au=Li,%20Xiaolong&rft.date=2024-04-28&rft_id=info:doi/10.48550/arxiv.2404.18065&rft_dat=%3Carxiv_GOX%3E2404_18065%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |