STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians

Recent progress in pre-trained diffusion models and 3D generation have spurred interest in 4D content creation. However, achieving high-fidelity 4D generation with spatial-temporal consistency remains a challenge. In this work, we propose STAG4D, a novel framework that combines pre-trained diffusion...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Zeng, Yifei, Jiang, Yanqin, Zhu, Siyu, Lu, Yuanxun, Lin, Youtian, Zhu, Hao, Hu, Weiming, Cao, Xun, Yao, Yao
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Zeng, Yifei Jiang, Yanqin Zhu, Siyu Lu, Yuanxun Lin, Youtian Zhu, Hao Hu, Weiming Cao, Xun Yao, Yao
description	Recent progress in pre-trained diffusion models and 3D generation have spurred interest in 4D content creation. However, achieving high-fidelity 4D generation with spatial-temporal consistency remains a challenge. In this work, we propose STAG4D, a novel framework that combines pre-trained diffusion models with dynamic 3D Gaussian splatting for high-fidelity 4D generation. Drawing inspiration from 3D generation techniques, we utilize a multi-view diffusion model to initialize multi-view images anchoring on the input video frames, where the video can be either real-world captured or generated by a video diffusion model. To ensure the temporal consistency of the multi-view sequence initialization, we introduce a simple yet effective fusion strategy to leverage the first frame as a temporal anchor in the self-attention computation. With the almost consistent multi-view sequences, we then apply the score distillation sampling to optimize the 4D Gaussian point cloud. The 4D Gaussian spatting is specially crafted for the generation task, where an adaptive densification strategy is proposed to mitigate the unstable Gaussian gradient for robust optimization. Notably, the proposed pipeline does not require any pre-training or fine-tuning of diffusion networks, offering a more accessible and practical solution for the 4D generation task. Extensive experiments demonstrate that our method outperforms prior 4D generation works in rendering quality, spatial-temporal consistency, and generation robustness, setting a new state-of-the-art for 4D generation from diverse inputs, including text, image, and video.
doi_str_mv	10.48550/arxiv.2403.14939
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2403_14939</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2403_14939</sourcerecordid><originalsourceid>FETCH-LOGICAL-a679-4ecb8db7989de9f6c839d04b39559bec9d5051592b897bcb268c3654c1d802813</originalsourceid><addsrcrecordid>eNotj71uwjAURr10qCgP0Am_QFI7thPfbhE_AQmJgezRtX0RkUKInBbB29MC0xk-6eg7jH1KkWprjPjCeG0vaaaFSqUGBe8M9nVZ6cU33w_402KX1HQazhE7Xvb-eI4UeEU9xb_xQlwveIW_49hiP36wtwN2I01fnLB6tazn62S7qzbzcptgXkCiyTsbXAEWAsEh91ZBENopMAYceQhGGGkgcxYK512WW69yo70MVmRWqgmbPbWP780Q2xPGW_Pf0Dwa1B0Emj-r</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians</title><source>arXiv.org</source><creator>Zeng, Yifei ; Jiang, Yanqin ; Zhu, Siyu ; Lu, Yuanxun ; Lin, Youtian ; Zhu, Hao ; Hu, Weiming ; Cao, Xun ; Yao, Yao</creator><creatorcontrib>Zeng, Yifei ; Jiang, Yanqin ; Zhu, Siyu ; Lu, Yuanxun ; Lin, Youtian ; Zhu, Hao ; Hu, Weiming ; Cao, Xun ; Yao, Yao</creatorcontrib><description>Recent progress in pre-trained diffusion models and 3D generation have spurred interest in 4D content creation. However, achieving high-fidelity 4D generation with spatial-temporal consistency remains a challenge. In this work, we propose STAG4D, a novel framework that combines pre-trained diffusion models with dynamic 3D Gaussian splatting for high-fidelity 4D generation. Drawing inspiration from 3D generation techniques, we utilize a multi-view diffusion model to initialize multi-view images anchoring on the input video frames, where the video can be either real-world captured or generated by a video diffusion model. To ensure the temporal consistency of the multi-view sequence initialization, we introduce a simple yet effective fusion strategy to leverage the first frame as a temporal anchor in the self-attention computation. With the almost consistent multi-view sequences, we then apply the score distillation sampling to optimize the 4D Gaussian point cloud. The 4D Gaussian spatting is specially crafted for the generation task, where an adaptive densification strategy is proposed to mitigate the unstable Gaussian gradient for robust optimization. Notably, the proposed pipeline does not require any pre-training or fine-tuning of diffusion networks, offering a more accessible and practical solution for the 4D generation task. Extensive experiments demonstrate that our method outperforms prior 4D generation works in rendering quality, spatial-temporal consistency, and generation robustness, setting a new state-of-the-art for 4D generation from diverse inputs, including text, image, and video.</description><identifier>DOI: 10.48550/arxiv.2403.14939</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-03</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2403.14939$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2403.14939$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zeng, Yifei</creatorcontrib><creatorcontrib>Jiang, Yanqin</creatorcontrib><creatorcontrib>Zhu, Siyu</creatorcontrib><creatorcontrib>Lu, Yuanxun</creatorcontrib><creatorcontrib>Lin, Youtian</creatorcontrib><creatorcontrib>Zhu, Hao</creatorcontrib><creatorcontrib>Hu, Weiming</creatorcontrib><creatorcontrib>Cao, Xun</creatorcontrib><creatorcontrib>Yao, Yao</creatorcontrib><title>STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians</title><description>Recent progress in pre-trained diffusion models and 3D generation have spurred interest in 4D content creation. However, achieving high-fidelity 4D generation with spatial-temporal consistency remains a challenge. In this work, we propose STAG4D, a novel framework that combines pre-trained diffusion models with dynamic 3D Gaussian splatting for high-fidelity 4D generation. Drawing inspiration from 3D generation techniques, we utilize a multi-view diffusion model to initialize multi-view images anchoring on the input video frames, where the video can be either real-world captured or generated by a video diffusion model. To ensure the temporal consistency of the multi-view sequence initialization, we introduce a simple yet effective fusion strategy to leverage the first frame as a temporal anchor in the self-attention computation. With the almost consistent multi-view sequences, we then apply the score distillation sampling to optimize the 4D Gaussian point cloud. The 4D Gaussian spatting is specially crafted for the generation task, where an adaptive densification strategy is proposed to mitigate the unstable Gaussian gradient for robust optimization. Notably, the proposed pipeline does not require any pre-training or fine-tuning of diffusion networks, offering a more accessible and practical solution for the 4D generation task. Extensive experiments demonstrate that our method outperforms prior 4D generation works in rendering quality, spatial-temporal consistency, and generation robustness, setting a new state-of-the-art for 4D generation from diverse inputs, including text, image, and video.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj71uwjAURr10qCgP0Am_QFI7thPfbhE_AQmJgezRtX0RkUKInBbB29MC0xk-6eg7jH1KkWprjPjCeG0vaaaFSqUGBe8M9nVZ6cU33w_402KX1HQazhE7Xvb-eI4UeEU9xb_xQlwveIW_49hiP36wtwN2I01fnLB6tazn62S7qzbzcptgXkCiyTsbXAEWAsEh91ZBENopMAYceQhGGGkgcxYK512WW69yo70MVmRWqgmbPbWP780Q2xPGW_Pf0Dwa1B0Emj-r</recordid><startdate>20240322</startdate><enddate>20240322</enddate><creator>Zeng, Yifei</creator><creator>Jiang, Yanqin</creator><creator>Zhu, Siyu</creator><creator>Lu, Yuanxun</creator><creator>Lin, Youtian</creator><creator>Zhu, Hao</creator><creator>Hu, Weiming</creator><creator>Cao, Xun</creator><creator>Yao, Yao</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240322</creationdate><title>STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians</title><author>Zeng, Yifei ; Jiang, Yanqin ; Zhu, Siyu ; Lu, Yuanxun ; Lin, Youtian ; Zhu, Hao ; Hu, Weiming ; Cao, Xun ; Yao, Yao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a679-4ecb8db7989de9f6c839d04b39559bec9d5051592b897bcb268c3654c1d802813</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Zeng, Yifei</creatorcontrib><creatorcontrib>Jiang, Yanqin</creatorcontrib><creatorcontrib>Zhu, Siyu</creatorcontrib><creatorcontrib>Lu, Yuanxun</creatorcontrib><creatorcontrib>Lin, Youtian</creatorcontrib><creatorcontrib>Zhu, Hao</creatorcontrib><creatorcontrib>Hu, Weiming</creatorcontrib><creatorcontrib>Cao, Xun</creatorcontrib><creatorcontrib>Yao, Yao</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zeng, Yifei</au><au>Jiang, Yanqin</au><au>Zhu, Siyu</au><au>Lu, Yuanxun</au><au>Lin, Youtian</au><au>Zhu, Hao</au><au>Hu, Weiming</au><au>Cao, Xun</au><au>Yao, Yao</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians</atitle><date>2024-03-22</date><risdate>2024</risdate><abstract>Recent progress in pre-trained diffusion models and 3D generation have spurred interest in 4D content creation. However, achieving high-fidelity 4D generation with spatial-temporal consistency remains a challenge. In this work, we propose STAG4D, a novel framework that combines pre-trained diffusion models with dynamic 3D Gaussian splatting for high-fidelity 4D generation. Drawing inspiration from 3D generation techniques, we utilize a multi-view diffusion model to initialize multi-view images anchoring on the input video frames, where the video can be either real-world captured or generated by a video diffusion model. To ensure the temporal consistency of the multi-view sequence initialization, we introduce a simple yet effective fusion strategy to leverage the first frame as a temporal anchor in the self-attention computation. With the almost consistent multi-view sequences, we then apply the score distillation sampling to optimize the 4D Gaussian point cloud. The 4D Gaussian spatting is specially crafted for the generation task, where an adaptive densification strategy is proposed to mitigate the unstable Gaussian gradient for robust optimization. Notably, the proposed pipeline does not require any pre-training or fine-tuning of diffusion networks, offering a more accessible and practical solution for the 4D generation task. Extensive experiments demonstrate that our method outperforms prior 4D generation works in rendering quality, spatial-temporal consistency, and generation robustness, setting a new state-of-the-art for 4D generation from diverse inputs, including text, image, and video.</abstract><doi>10.48550/arxiv.2403.14939</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2403.14939
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2403_14939
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T16%3A14%3A28IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=STAG4D:%20Spatial-Temporal%20Anchored%20Generative%204D%20Gaussians&rft.au=Zeng,%20Yifei&rft.date=2024-03-22&rft_id=info:doi/10.48550/arxiv.2403.14939&rft_dat=%3Carxiv_GOX%3E2403_14939%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true