Consistent123: Improve Consistency for One Image to 3D Object Synthesis

Large image diffusion models enable novel view synthesis with high quality and excellent zero-shot capability. However, such models based on image-to-image translation have no guarantee of view consistency, limiting the performance for downstream tasks like 3D reconstruction and image-to-3D generati...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Weng, Haohan, Yang, Tianyu, Wang, Jianan, Li, Yu, Zhang, Tong, Chen, C. L. Philip, Zhang, Lei
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Weng, Haohan Yang, Tianyu Wang, Jianan Li, Yu Zhang, Tong Chen, C. L. Philip Zhang, Lei
description	Large image diffusion models enable novel view synthesis with high quality and excellent zero-shot capability. However, such models based on image-to-image translation have no guarantee of view consistency, limiting the performance for downstream tasks like 3D reconstruction and image-to-3D generation. To empower consistency, we propose Consistent123 to synthesize novel views simultaneously by incorporating additional cross-view attention layers and the shared self-attention mechanism. The proposed attention mechanism improves the interaction across all synthesized views, as well as the alignment between the condition view and novel views. In the sampling stage, such architecture supports simultaneously generating an arbitrary number of views while training at a fixed length. We also introduce a progressive classifier-free guidance strategy to achieve the trade-off between texture and geometry for synthesized object views. Qualitative and quantitative experiments show that Consistent123 outperforms baselines in view consistency by a large margin. Furthermore, we demonstrate a significant improvement of Consistent123 on varying downstream tasks, showing its great potential in the 3D generation field. The project page is available at consistent-123.github.io.
doi_str_mv	10.48550/arxiv.2310.08092
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2310_08092</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2310_08092</sourcerecordid><originalsourceid>FETCH-LOGICAL-a672-5ce4a4371a0ebdc562fd9a0280d69f3f2aafa8074337ed3a3aab506802b8ecde3</originalsourceid><addsrcrecordid>eNo9j8FOwzAQRH3pAbV8ACf8Aykbb5w4vaEApVKlHOg92thrCGqTyrEq8veEgjiN9GY00hPiLoV1ZrSGBwpf3WWtcAZgoFQ3YlsN_diNkfuYKtzI3ekchgvLf2wn6Ycg657njt5ZxkHik6zbT7ZRvk19_OB5uRILT8eRb_9yKQ4vz4fqNdnX2131uE8oL1SiLWeUYZEScOuszpV3JYEy4PLSo1dEngwUGWLBDgmJWg25AdUato5xKe5_b68mzTl0JwpT82PUXI3wG7iSRe0</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Consistent123: Improve Consistency for One Image to 3D Object Synthesis</title><source>arXiv.org</source><creator>Weng, Haohan ; Yang, Tianyu ; Wang, Jianan ; Li, Yu ; Zhang, Tong ; Chen, C. L. Philip ; Zhang, Lei</creator><creatorcontrib>Weng, Haohan ; Yang, Tianyu ; Wang, Jianan ; Li, Yu ; Zhang, Tong ; Chen, C. L. Philip ; Zhang, Lei</creatorcontrib><description>Large image diffusion models enable novel view synthesis with high quality and excellent zero-shot capability. However, such models based on image-to-image translation have no guarantee of view consistency, limiting the performance for downstream tasks like 3D reconstruction and image-to-3D generation. To empower consistency, we propose Consistent123 to synthesize novel views simultaneously by incorporating additional cross-view attention layers and the shared self-attention mechanism. The proposed attention mechanism improves the interaction across all synthesized views, as well as the alignment between the condition view and novel views. In the sampling stage, such architecture supports simultaneously generating an arbitrary number of views while training at a fixed length. We also introduce a progressive classifier-free guidance strategy to achieve the trade-off between texture and geometry for synthesized object views. Qualitative and quantitative experiments show that Consistent123 outperforms baselines in view consistency by a large margin. Furthermore, we demonstrate a significant improvement of Consistent123 on varying downstream tasks, showing its great potential in the 3D generation field. The project page is available at consistent-123.github.io.</description><identifier>DOI: 10.48550/arxiv.2310.08092</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-10</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2310.08092$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2310.08092$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Weng, Haohan</creatorcontrib><creatorcontrib>Yang, Tianyu</creatorcontrib><creatorcontrib>Wang, Jianan</creatorcontrib><creatorcontrib>Li, Yu</creatorcontrib><creatorcontrib>Zhang, Tong</creatorcontrib><creatorcontrib>Chen, C. L. Philip</creatorcontrib><creatorcontrib>Zhang, Lei</creatorcontrib><title>Consistent123: Improve Consistency for One Image to 3D Object Synthesis</title><description>Large image diffusion models enable novel view synthesis with high quality and excellent zero-shot capability. However, such models based on image-to-image translation have no guarantee of view consistency, limiting the performance for downstream tasks like 3D reconstruction and image-to-3D generation. To empower consistency, we propose Consistent123 to synthesize novel views simultaneously by incorporating additional cross-view attention layers and the shared self-attention mechanism. The proposed attention mechanism improves the interaction across all synthesized views, as well as the alignment between the condition view and novel views. In the sampling stage, such architecture supports simultaneously generating an arbitrary number of views while training at a fixed length. We also introduce a progressive classifier-free guidance strategy to achieve the trade-off between texture and geometry for synthesized object views. Qualitative and quantitative experiments show that Consistent123 outperforms baselines in view consistency by a large margin. Furthermore, we demonstrate a significant improvement of Consistent123 on varying downstream tasks, showing its great potential in the 3D generation field. The project page is available at consistent-123.github.io.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNo9j8FOwzAQRH3pAbV8ACf8Aykbb5w4vaEApVKlHOg92thrCGqTyrEq8veEgjiN9GY00hPiLoV1ZrSGBwpf3WWtcAZgoFQ3YlsN_diNkfuYKtzI3ekchgvLf2wn6Ycg657njt5ZxkHik6zbT7ZRvk19_OB5uRILT8eRb_9yKQ4vz4fqNdnX2131uE8oL1SiLWeUYZEScOuszpV3JYEy4PLSo1dEngwUGWLBDgmJWg25AdUato5xKe5_b68mzTl0JwpT82PUXI3wG7iSRe0</recordid><startdate>20231012</startdate><enddate>20231012</enddate><creator>Weng, Haohan</creator><creator>Yang, Tianyu</creator><creator>Wang, Jianan</creator><creator>Li, Yu</creator><creator>Zhang, Tong</creator><creator>Chen, C. L. Philip</creator><creator>Zhang, Lei</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231012</creationdate><title>Consistent123: Improve Consistency for One Image to 3D Object Synthesis</title><author>Weng, Haohan ; Yang, Tianyu ; Wang, Jianan ; Li, Yu ; Zhang, Tong ; Chen, C. L. Philip ; Zhang, Lei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a672-5ce4a4371a0ebdc562fd9a0280d69f3f2aafa8074337ed3a3aab506802b8ecde3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Weng, Haohan</creatorcontrib><creatorcontrib>Yang, Tianyu</creatorcontrib><creatorcontrib>Wang, Jianan</creatorcontrib><creatorcontrib>Li, Yu</creatorcontrib><creatorcontrib>Zhang, Tong</creatorcontrib><creatorcontrib>Chen, C. L. Philip</creatorcontrib><creatorcontrib>Zhang, Lei</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Weng, Haohan</au><au>Yang, Tianyu</au><au>Wang, Jianan</au><au>Li, Yu</au><au>Zhang, Tong</au><au>Chen, C. L. Philip</au><au>Zhang, Lei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Consistent123: Improve Consistency for One Image to 3D Object Synthesis</atitle><date>2023-10-12</date><risdate>2023</risdate><abstract>Large image diffusion models enable novel view synthesis with high quality and excellent zero-shot capability. However, such models based on image-to-image translation have no guarantee of view consistency, limiting the performance for downstream tasks like 3D reconstruction and image-to-3D generation. To empower consistency, we propose Consistent123 to synthesize novel views simultaneously by incorporating additional cross-view attention layers and the shared self-attention mechanism. The proposed attention mechanism improves the interaction across all synthesized views, as well as the alignment between the condition view and novel views. In the sampling stage, such architecture supports simultaneously generating an arbitrary number of views while training at a fixed length. We also introduce a progressive classifier-free guidance strategy to achieve the trade-off between texture and geometry for synthesized object views. Qualitative and quantitative experiments show that Consistent123 outperforms baselines in view consistency by a large margin. Furthermore, we demonstrate a significant improvement of Consistent123 on varying downstream tasks, showing its great potential in the 3D generation field. The project page is available at consistent-123.github.io.</abstract><doi>10.48550/arxiv.2310.08092</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2310.08092
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2310_08092
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	Consistent123: Improve Consistency for One Image to 3D Object Synthesis
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-25T11%3A12%3A56IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Consistent123:%20Improve%20Consistency%20for%20One%20Image%20to%203D%20Object%20Synthesis&rft.au=Weng,%20Haohan&rft.date=2023-10-12&rft_id=info:doi/10.48550/arxiv.2310.08092&rft_dat=%3Carxiv_GOX%3E2310_08092%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true