Consistent123: Improve Consistency for One Image to 3D Object Synthesis

Large image diffusion models enable novel view synthesis with high quality and excellent zero-shot capability. However, such models based on image-to-image translation have no guarantee of view consistency, limiting the performance for downstream tasks like 3D reconstruction and image-to-3D generati...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Weng, Haohan, Yang, Tianyu, Wang, Jianan, Li, Yu, Zhang, Tong, Chen, C. L. Philip, Zhang, Lei
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Weng, Haohan
Yang, Tianyu
Wang, Jianan
Li, Yu
Zhang, Tong
Chen, C. L. Philip
Zhang, Lei
description Large image diffusion models enable novel view synthesis with high quality and excellent zero-shot capability. However, such models based on image-to-image translation have no guarantee of view consistency, limiting the performance for downstream tasks like 3D reconstruction and image-to-3D generation. To empower consistency, we propose Consistent123 to synthesize novel views simultaneously by incorporating additional cross-view attention layers and the shared self-attention mechanism. The proposed attention mechanism improves the interaction across all synthesized views, as well as the alignment between the condition view and novel views. In the sampling stage, such architecture supports simultaneously generating an arbitrary number of views while training at a fixed length. We also introduce a progressive classifier-free guidance strategy to achieve the trade-off between texture and geometry for synthesized object views. Qualitative and quantitative experiments show that Consistent123 outperforms baselines in view consistency by a large margin. Furthermore, we demonstrate a significant improvement of Consistent123 on varying downstream tasks, showing its great potential in the 3D generation field. The project page is available at consistent-123.github.io.
doi_str_mv 10.48550/arxiv.2310.08092
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2310_08092</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2310_08092</sourcerecordid><originalsourceid>FETCH-LOGICAL-a672-5ce4a4371a0ebdc562fd9a0280d69f3f2aafa8074337ed3a3aab506802b8ecde3</originalsourceid><addsrcrecordid>eNo9j8FOwzAQRH3pAbV8ACf8Aykbb5w4vaEApVKlHOg92thrCGqTyrEq8veEgjiN9GY00hPiLoV1ZrSGBwpf3WWtcAZgoFQ3YlsN_diNkfuYKtzI3ekchgvLf2wn6Ycg657njt5ZxkHik6zbT7ZRvk19_OB5uRILT8eRb_9yKQ4vz4fqNdnX2131uE8oL1SiLWeUYZEScOuszpV3JYEy4PLSo1dEngwUGWLBDgmJWg25AdUato5xKe5_b68mzTl0JwpT82PUXI3wG7iSRe0</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Consistent123: Improve Consistency for One Image to 3D Object Synthesis</title><source>arXiv.org</source><creator>Weng, Haohan ; Yang, Tianyu ; Wang, Jianan ; Li, Yu ; Zhang, Tong ; Chen, C. L. Philip ; Zhang, Lei</creator><creatorcontrib>Weng, Haohan ; Yang, Tianyu ; Wang, Jianan ; Li, Yu ; Zhang, Tong ; Chen, C. L. Philip ; Zhang, Lei</creatorcontrib><description>Large image diffusion models enable novel view synthesis with high quality and excellent zero-shot capability. However, such models based on image-to-image translation have no guarantee of view consistency, limiting the performance for downstream tasks like 3D reconstruction and image-to-3D generation. To empower consistency, we propose Consistent123 to synthesize novel views simultaneously by incorporating additional cross-view attention layers and the shared self-attention mechanism. The proposed attention mechanism improves the interaction across all synthesized views, as well as the alignment between the condition view and novel views. In the sampling stage, such architecture supports simultaneously generating an arbitrary number of views while training at a fixed length. We also introduce a progressive classifier-free guidance strategy to achieve the trade-off between texture and geometry for synthesized object views. Qualitative and quantitative experiments show that Consistent123 outperforms baselines in view consistency by a large margin. Furthermore, we demonstrate a significant improvement of Consistent123 on varying downstream tasks, showing its great potential in the 3D generation field. The project page is available at consistent-123.github.io.</description><identifier>DOI: 10.48550/arxiv.2310.08092</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-10</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2310.08092$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2310.08092$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Weng, Haohan</creatorcontrib><creatorcontrib>Yang, Tianyu</creatorcontrib><creatorcontrib>Wang, Jianan</creatorcontrib><creatorcontrib>Li, Yu</creatorcontrib><creatorcontrib>Zhang, Tong</creatorcontrib><creatorcontrib>Chen, C. L. Philip</creatorcontrib><creatorcontrib>Zhang, Lei</creatorcontrib><title>Consistent123: Improve Consistency for One Image to 3D Object Synthesis</title><description>Large image diffusion models enable novel view synthesis with high quality and excellent zero-shot capability. However, such models based on image-to-image translation have no guarantee of view consistency, limiting the performance for downstream tasks like 3D reconstruction and image-to-3D generation. To empower consistency, we propose Consistent123 to synthesize novel views simultaneously by incorporating additional cross-view attention layers and the shared self-attention mechanism. The proposed attention mechanism improves the interaction across all synthesized views, as well as the alignment between the condition view and novel views. In the sampling stage, such architecture supports simultaneously generating an arbitrary number of views while training at a fixed length. We also introduce a progressive classifier-free guidance strategy to achieve the trade-off between texture and geometry for synthesized object views. Qualitative and quantitative experiments show that Consistent123 outperforms baselines in view consistency by a large margin. Furthermore, we demonstrate a significant improvement of Consistent123 on varying downstream tasks, showing its great potential in the 3D generation field. The project page is available at consistent-123.github.io.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNo9j8FOwzAQRH3pAbV8ACf8Aykbb5w4vaEApVKlHOg92thrCGqTyrEq8veEgjiN9GY00hPiLoV1ZrSGBwpf3WWtcAZgoFQ3YlsN_diNkfuYKtzI3ekchgvLf2wn6Ycg657njt5ZxkHik6zbT7ZRvk19_OB5uRILT8eRb_9yKQ4vz4fqNdnX2131uE8oL1SiLWeUYZEScOuszpV3JYEy4PLSo1dEngwUGWLBDgmJWg25AdUato5xKe5_b68mzTl0JwpT82PUXI3wG7iSRe0</recordid><startdate>20231012</startdate><enddate>20231012</enddate><creator>Weng, Haohan</creator><creator>Yang, Tianyu</creator><creator>Wang, Jianan</creator><creator>Li, Yu</creator><creator>Zhang, Tong</creator><creator>Chen, C. L. Philip</creator><creator>Zhang, Lei</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231012</creationdate><title>Consistent123: Improve Consistency for One Image to 3D Object Synthesis</title><author>Weng, Haohan ; Yang, Tianyu ; Wang, Jianan ; Li, Yu ; Zhang, Tong ; Chen, C. L. Philip ; Zhang, Lei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a672-5ce4a4371a0ebdc562fd9a0280d69f3f2aafa8074337ed3a3aab506802b8ecde3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Weng, Haohan</creatorcontrib><creatorcontrib>Yang, Tianyu</creatorcontrib><creatorcontrib>Wang, Jianan</creatorcontrib><creatorcontrib>Li, Yu</creatorcontrib><creatorcontrib>Zhang, Tong</creatorcontrib><creatorcontrib>Chen, C. L. Philip</creatorcontrib><creatorcontrib>Zhang, Lei</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Weng, Haohan</au><au>Yang, Tianyu</au><au>Wang, Jianan</au><au>Li, Yu</au><au>Zhang, Tong</au><au>Chen, C. L. Philip</au><au>Zhang, Lei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Consistent123: Improve Consistency for One Image to 3D Object Synthesis</atitle><date>2023-10-12</date><risdate>2023</risdate><abstract>Large image diffusion models enable novel view synthesis with high quality and excellent zero-shot capability. However, such models based on image-to-image translation have no guarantee of view consistency, limiting the performance for downstream tasks like 3D reconstruction and image-to-3D generation. To empower consistency, we propose Consistent123 to synthesize novel views simultaneously by incorporating additional cross-view attention layers and the shared self-attention mechanism. The proposed attention mechanism improves the interaction across all synthesized views, as well as the alignment between the condition view and novel views. In the sampling stage, such architecture supports simultaneously generating an arbitrary number of views while training at a fixed length. We also introduce a progressive classifier-free guidance strategy to achieve the trade-off between texture and geometry for synthesized object views. Qualitative and quantitative experiments show that Consistent123 outperforms baselines in view consistency by a large margin. Furthermore, we demonstrate a significant improvement of Consistent123 on varying downstream tasks, showing its great potential in the 3D generation field. The project page is available at consistent-123.github.io.</abstract><doi>10.48550/arxiv.2310.08092</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2310.08092
ispartof
issn
language eng
recordid cdi_arxiv_primary_2310_08092
source arXiv.org
subjects Computer Science - Computer Vision and Pattern Recognition
title Consistent123: Improve Consistency for One Image to 3D Object Synthesis
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-25T11%3A12%3A56IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Consistent123:%20Improve%20Consistency%20for%20One%20Image%20to%203D%20Object%20Synthesis&rft.au=Weng,%20Haohan&rft.date=2023-10-12&rft_id=info:doi/10.48550/arxiv.2310.08092&rft_dat=%3Carxiv_GOX%3E2310_08092%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true