Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model

In this paper, we propose an effective two-stage approach named Grounded-Dreamer to generate 3D assets that can accurately follow complex, compositional text prompts while achieving high fidelity by using a pre-trained multi-view diffusion model. Multi-view diffusion models, such as MVDream, have sh...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Li, Xiaolong, Mo, Jiawei, Wang, Ying, Parameshwara, Chethan, Fei, Xiaohan, Swaminathan, Ashwin, Taylor, CJ, Tu, Zhuowen, Favaro, Paolo, Soatto, Stefano
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Li, Xiaolong
Mo, Jiawei
Wang, Ying
Parameshwara, Chethan
Fei, Xiaohan
Swaminathan, Ashwin
Taylor, CJ
Tu, Zhuowen
Favaro, Paolo
Soatto, Stefano
description In this paper, we propose an effective two-stage approach named Grounded-Dreamer to generate 3D assets that can accurately follow complex, compositional text prompts while achieving high fidelity by using a pre-trained multi-view diffusion model. Multi-view diffusion models, such as MVDream, have shown to generate high-fidelity 3D assets using score distillation sampling (SDS). However, applied naively, these methods often fail to comprehend compositional text prompts, and may often entirely omit certain subjects or parts. To address this issue, we first advocate leveraging text-guided 4-view images as the bottleneck in the text-to-3D pipeline. We then introduce an attention refocusing mechanism to encourage text-aligned 4-view image generation, without the necessity to re-train the multi-view diffusion model or craft a high-quality compositional 3D dataset. We further propose a hybrid optimization strategy to encourage synergy between the SDS loss and the sparse RGB reference images. Our method consistently outperforms previous state-of-the-art (SOTA) methods in generating compositional 3D assets, excelling in both quality and accuracy, and enabling diverse 3D from the same text prompt.
doi_str_mv 10.48550/arxiv.2404.18065
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2404_18065</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2404_18065</sourcerecordid><originalsourceid>FETCH-LOGICAL-a675-f5347de064f988d82e6f5c9278a12fe7c83d58701a696674185ddee76ce9b5af3</originalsourceid><addsrcrecordid>eNotj71OwzAYRb0woMIDMOEXcLDj34wohYLUCoYIdYvc-LOwlMaV47Tl7Qml013uOdJB6IHRQhgp6ZNN53AsSkFFwQxV8hZtVylOgwOH67g_xDHkEAfbYzs4vAxHSCPgBs6Z5Ej4Ep9C_safCXKyYZihzdTnQL4CnOa399M403gTHfR36MbbfoT76y5Q8_rS1G9k_bF6r5_XxCotiZdcaAdUCV8Z40wJysuuKrWxrPSgO8OdNJoyqyqltGBGOgegVQfVTlrPF-jxX3tJaw8p7G36af8S20si_wXMGkwy</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model</title><source>arXiv.org</source><creator>Li, Xiaolong ; Mo, Jiawei ; Wang, Ying ; Parameshwara, Chethan ; Fei, Xiaohan ; Swaminathan, Ashwin ; Taylor, CJ ; Tu, Zhuowen ; Favaro, Paolo ; Soatto, Stefano</creator><creatorcontrib>Li, Xiaolong ; Mo, Jiawei ; Wang, Ying ; Parameshwara, Chethan ; Fei, Xiaohan ; Swaminathan, Ashwin ; Taylor, CJ ; Tu, Zhuowen ; Favaro, Paolo ; Soatto, Stefano</creatorcontrib><description>In this paper, we propose an effective two-stage approach named Grounded-Dreamer to generate 3D assets that can accurately follow complex, compositional text prompts while achieving high fidelity by using a pre-trained multi-view diffusion model. Multi-view diffusion models, such as MVDream, have shown to generate high-fidelity 3D assets using score distillation sampling (SDS). However, applied naively, these methods often fail to comprehend compositional text prompts, and may often entirely omit certain subjects or parts. To address this issue, we first advocate leveraging text-guided 4-view images as the bottleneck in the text-to-3D pipeline. We then introduce an attention refocusing mechanism to encourage text-aligned 4-view image generation, without the necessity to re-train the multi-view diffusion model or craft a high-quality compositional 3D dataset. We further propose a hybrid optimization strategy to encourage synergy between the SDS loss and the sparse RGB reference images. Our method consistently outperforms previous state-of-the-art (SOTA) methods in generating compositional 3D assets, excelling in both quality and accuracy, and enabling diverse 3D from the same text prompt.</description><identifier>DOI: 10.48550/arxiv.2404.18065</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-04</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2404.18065$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2404.18065$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Li, Xiaolong</creatorcontrib><creatorcontrib>Mo, Jiawei</creatorcontrib><creatorcontrib>Wang, Ying</creatorcontrib><creatorcontrib>Parameshwara, Chethan</creatorcontrib><creatorcontrib>Fei, Xiaohan</creatorcontrib><creatorcontrib>Swaminathan, Ashwin</creatorcontrib><creatorcontrib>Taylor, CJ</creatorcontrib><creatorcontrib>Tu, Zhuowen</creatorcontrib><creatorcontrib>Favaro, Paolo</creatorcontrib><creatorcontrib>Soatto, Stefano</creatorcontrib><title>Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model</title><description>In this paper, we propose an effective two-stage approach named Grounded-Dreamer to generate 3D assets that can accurately follow complex, compositional text prompts while achieving high fidelity by using a pre-trained multi-view diffusion model. Multi-view diffusion models, such as MVDream, have shown to generate high-fidelity 3D assets using score distillation sampling (SDS). However, applied naively, these methods often fail to comprehend compositional text prompts, and may often entirely omit certain subjects or parts. To address this issue, we first advocate leveraging text-guided 4-view images as the bottleneck in the text-to-3D pipeline. We then introduce an attention refocusing mechanism to encourage text-aligned 4-view image generation, without the necessity to re-train the multi-view diffusion model or craft a high-quality compositional 3D dataset. We further propose a hybrid optimization strategy to encourage synergy between the SDS loss and the sparse RGB reference images. Our method consistently outperforms previous state-of-the-art (SOTA) methods in generating compositional 3D assets, excelling in both quality and accuracy, and enabling diverse 3D from the same text prompt.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj71OwzAYRb0woMIDMOEXcLDj34wohYLUCoYIdYvc-LOwlMaV47Tl7Qml013uOdJB6IHRQhgp6ZNN53AsSkFFwQxV8hZtVylOgwOH67g_xDHkEAfbYzs4vAxHSCPgBs6Z5Ej4Ep9C_safCXKyYZihzdTnQL4CnOa399M403gTHfR36MbbfoT76y5Q8_rS1G9k_bF6r5_XxCotiZdcaAdUCV8Z40wJysuuKrWxrPSgO8OdNJoyqyqltGBGOgegVQfVTlrPF-jxX3tJaw8p7G36af8S20si_wXMGkwy</recordid><startdate>20240428</startdate><enddate>20240428</enddate><creator>Li, Xiaolong</creator><creator>Mo, Jiawei</creator><creator>Wang, Ying</creator><creator>Parameshwara, Chethan</creator><creator>Fei, Xiaohan</creator><creator>Swaminathan, Ashwin</creator><creator>Taylor, CJ</creator><creator>Tu, Zhuowen</creator><creator>Favaro, Paolo</creator><creator>Soatto, Stefano</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240428</creationdate><title>Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model</title><author>Li, Xiaolong ; Mo, Jiawei ; Wang, Ying ; Parameshwara, Chethan ; Fei, Xiaohan ; Swaminathan, Ashwin ; Taylor, CJ ; Tu, Zhuowen ; Favaro, Paolo ; Soatto, Stefano</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a675-f5347de064f988d82e6f5c9278a12fe7c83d58701a696674185ddee76ce9b5af3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Li, Xiaolong</creatorcontrib><creatorcontrib>Mo, Jiawei</creatorcontrib><creatorcontrib>Wang, Ying</creatorcontrib><creatorcontrib>Parameshwara, Chethan</creatorcontrib><creatorcontrib>Fei, Xiaohan</creatorcontrib><creatorcontrib>Swaminathan, Ashwin</creatorcontrib><creatorcontrib>Taylor, CJ</creatorcontrib><creatorcontrib>Tu, Zhuowen</creatorcontrib><creatorcontrib>Favaro, Paolo</creatorcontrib><creatorcontrib>Soatto, Stefano</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Li, Xiaolong</au><au>Mo, Jiawei</au><au>Wang, Ying</au><au>Parameshwara, Chethan</au><au>Fei, Xiaohan</au><au>Swaminathan, Ashwin</au><au>Taylor, CJ</au><au>Tu, Zhuowen</au><au>Favaro, Paolo</au><au>Soatto, Stefano</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model</atitle><date>2024-04-28</date><risdate>2024</risdate><abstract>In this paper, we propose an effective two-stage approach named Grounded-Dreamer to generate 3D assets that can accurately follow complex, compositional text prompts while achieving high fidelity by using a pre-trained multi-view diffusion model. Multi-view diffusion models, such as MVDream, have shown to generate high-fidelity 3D assets using score distillation sampling (SDS). However, applied naively, these methods often fail to comprehend compositional text prompts, and may often entirely omit certain subjects or parts. To address this issue, we first advocate leveraging text-guided 4-view images as the bottleneck in the text-to-3D pipeline. We then introduce an attention refocusing mechanism to encourage text-aligned 4-view image generation, without the necessity to re-train the multi-view diffusion model or craft a high-quality compositional 3D dataset. We further propose a hybrid optimization strategy to encourage synergy between the SDS loss and the sparse RGB reference images. Our method consistently outperforms previous state-of-the-art (SOTA) methods in generating compositional 3D assets, excelling in both quality and accuracy, and enabling diverse 3D from the same text prompt.</abstract><doi>10.48550/arxiv.2404.18065</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2404.18065
ispartof
issn
language eng
recordid cdi_arxiv_primary_2404_18065
source arXiv.org
subjects Computer Science - Artificial Intelligence
Computer Science - Computer Vision and Pattern Recognition
title Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T00%3A39%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Grounded%20Compositional%20and%20Diverse%20Text-to-3D%20with%20Pretrained%20Multi-View%20Diffusion%20Model&rft.au=Li,%20Xiaolong&rft.date=2024-04-28&rft_id=info:doi/10.48550/arxiv.2404.18065&rft_dat=%3Carxiv_GOX%3E2404_18065%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true