Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models

We address the problem of multi-object 3D pose control in image diffusion models. Instead of conditioning on a sequence of text tokens, we propose to use a set of per-object representations, Neural Assets, to control the 3D pose of individual objects in a scene. Neural Assets are obtained by pooling...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Wu, Ziyi, Rubanova, Yulia, Kabra, Rishabh, Hudson, Drew A, Gilitschenski, Igor, Aytar, Yusuf, van Steenkiste, Sjoerd, Allen, Kelsey R, Kipf, Thomas
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Wu, Ziyi Rubanova, Yulia Kabra, Rishabh Hudson, Drew A Gilitschenski, Igor Aytar, Yusuf van Steenkiste, Sjoerd Allen, Kelsey R Kipf, Thomas
description	We address the problem of multi-object 3D pose control in image diffusion models. Instead of conditioning on a sequence of text tokens, we propose to use a set of per-object representations, Neural Assets, to control the 3D pose of individual objects in a scene. Neural Assets are obtained by pooling visual representations of objects from a reference image, such as a frame in a video, and are trained to reconstruct the respective objects in a different image, e.g., a later frame in the video. Importantly, we encode object visuals from the reference image while conditioning on object poses from the target frame. This enables learning disentangled appearance and pose features. Combining visual and 3D pose representations in a sequence-of-tokens format allows us to keep the text-to-image architecture of existing models, with Neural Assets in place of text tokens. By fine-tuning a pre-trained text-to-image diffusion model with this information, our approach enables fine-grained 3D pose and placement control of individual objects in a scene. We further demonstrate that Neural Assets can be transferred and recomposed across different scenes. Our model achieves state-of-the-art multi-object editing results on both synthetic 3D scene datasets, as well as two real-world video datasets (Objectron, Waymo Open).
doi_str_mv	10.48550/arxiv.2406.09292
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2406_09292</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2406_09292</sourcerecordid><originalsourceid>FETCH-LOGICAL-a672-2127cdfe4f7d8bfcd9242dad50f2f6b58405d842b059474e3605d6c1857bae613</originalsourceid><addsrcrecordid>eNotz7tOwzAYhmEvDKhwAUz1DTg4jk9hi1oOlVpaqd0jH35TozRFdkLp3QOF6dO7fNKD0F1JC66FoPcmfcXPgnEqC1qzml2jzSuMyXS4yRmG_ICrOWlOJgFejd0Qydq-gxvw1kEPeHvuhz3kmPEpDnu8OJg3wPMYwpjjscero4cu36CrYLoMt_87Qbunx93shSzXz4tZsyRGKkZYyZTzAXhQXtvgfM0488YLGliQVmhOhdecWSpqrjhU8qelK7VQ1oAsqwma_t1eSO1HigeTzu0vrb3Qqm-XIUia</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models</title><source>arXiv.org</source><creator>Wu, Ziyi ; Rubanova, Yulia ; Kabra, Rishabh ; Hudson, Drew A ; Gilitschenski, Igor ; Aytar, Yusuf ; van Steenkiste, Sjoerd ; Allen, Kelsey R ; Kipf, Thomas</creator><creatorcontrib>Wu, Ziyi ; Rubanova, Yulia ; Kabra, Rishabh ; Hudson, Drew A ; Gilitschenski, Igor ; Aytar, Yusuf ; van Steenkiste, Sjoerd ; Allen, Kelsey R ; Kipf, Thomas</creatorcontrib><description>We address the problem of multi-object 3D pose control in image diffusion models. Instead of conditioning on a sequence of text tokens, we propose to use a set of per-object representations, Neural Assets, to control the 3D pose of individual objects in a scene. Neural Assets are obtained by pooling visual representations of objects from a reference image, such as a frame in a video, and are trained to reconstruct the respective objects in a different image, e.g., a later frame in the video. Importantly, we encode object visuals from the reference image while conditioning on object poses from the target frame. This enables learning disentangled appearance and pose features. Combining visual and 3D pose representations in a sequence-of-tokens format allows us to keep the text-to-image architecture of existing models, with Neural Assets in place of text tokens. By fine-tuning a pre-trained text-to-image diffusion model with this information, our approach enables fine-grained 3D pose and placement control of individual objects in a scene. We further demonstrate that Neural Assets can be transferred and recomposed across different scenes. Our model achieves state-of-the-art multi-object editing results on both synthetic 3D scene datasets, as well as two real-world video datasets (Objectron, Waymo Open).</description><identifier>DOI: 10.48550/arxiv.2406.09292</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computer Vision and Pattern Recognition ; Computer Science - Learning</subject><creationdate>2024-06</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,781,886</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2406.09292$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2406.09292$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Wu, Ziyi</creatorcontrib><creatorcontrib>Rubanova, Yulia</creatorcontrib><creatorcontrib>Kabra, Rishabh</creatorcontrib><creatorcontrib>Hudson, Drew A</creatorcontrib><creatorcontrib>Gilitschenski, Igor</creatorcontrib><creatorcontrib>Aytar, Yusuf</creatorcontrib><creatorcontrib>van Steenkiste, Sjoerd</creatorcontrib><creatorcontrib>Allen, Kelsey R</creatorcontrib><creatorcontrib>Kipf, Thomas</creatorcontrib><title>Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models</title><description>We address the problem of multi-object 3D pose control in image diffusion models. Instead of conditioning on a sequence of text tokens, we propose to use a set of per-object representations, Neural Assets, to control the 3D pose of individual objects in a scene. Neural Assets are obtained by pooling visual representations of objects from a reference image, such as a frame in a video, and are trained to reconstruct the respective objects in a different image, e.g., a later frame in the video. Importantly, we encode object visuals from the reference image while conditioning on object poses from the target frame. This enables learning disentangled appearance and pose features. Combining visual and 3D pose representations in a sequence-of-tokens format allows us to keep the text-to-image architecture of existing models, with Neural Assets in place of text tokens. By fine-tuning a pre-trained text-to-image diffusion model with this information, our approach enables fine-grained 3D pose and placement control of individual objects in a scene. We further demonstrate that Neural Assets can be transferred and recomposed across different scenes. Our model achieves state-of-the-art multi-object editing results on both synthetic 3D scene datasets, as well as two real-world video datasets (Objectron, Waymo Open).</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz7tOwzAYhmEvDKhwAUz1DTg4jk9hi1oOlVpaqd0jH35TozRFdkLp3QOF6dO7fNKD0F1JC66FoPcmfcXPgnEqC1qzml2jzSuMyXS4yRmG_ICrOWlOJgFejd0Qydq-gxvw1kEPeHvuhz3kmPEpDnu8OJg3wPMYwpjjscero4cu36CrYLoMt_87Qbunx93shSzXz4tZsyRGKkZYyZTzAXhQXtvgfM0488YLGliQVmhOhdecWSpqrjhU8qelK7VQ1oAsqwma_t1eSO1HigeTzu0vrb3Qqm-XIUia</recordid><startdate>20240613</startdate><enddate>20240613</enddate><creator>Wu, Ziyi</creator><creator>Rubanova, Yulia</creator><creator>Kabra, Rishabh</creator><creator>Hudson, Drew A</creator><creator>Gilitschenski, Igor</creator><creator>Aytar, Yusuf</creator><creator>van Steenkiste, Sjoerd</creator><creator>Allen, Kelsey R</creator><creator>Kipf, Thomas</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240613</creationdate><title>Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models</title><author>Wu, Ziyi ; Rubanova, Yulia ; Kabra, Rishabh ; Hudson, Drew A ; Gilitschenski, Igor ; Aytar, Yusuf ; van Steenkiste, Sjoerd ; Allen, Kelsey R ; Kipf, Thomas</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a672-2127cdfe4f7d8bfcd9242dad50f2f6b58405d842b059474e3605d6c1857bae613</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Wu, Ziyi</creatorcontrib><creatorcontrib>Rubanova, Yulia</creatorcontrib><creatorcontrib>Kabra, Rishabh</creatorcontrib><creatorcontrib>Hudson, Drew A</creatorcontrib><creatorcontrib>Gilitschenski, Igor</creatorcontrib><creatorcontrib>Aytar, Yusuf</creatorcontrib><creatorcontrib>van Steenkiste, Sjoerd</creatorcontrib><creatorcontrib>Allen, Kelsey R</creatorcontrib><creatorcontrib>Kipf, Thomas</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wu, Ziyi</au><au>Rubanova, Yulia</au><au>Kabra, Rishabh</au><au>Hudson, Drew A</au><au>Gilitschenski, Igor</au><au>Aytar, Yusuf</au><au>van Steenkiste, Sjoerd</au><au>Allen, Kelsey R</au><au>Kipf, Thomas</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models</atitle><date>2024-06-13</date><risdate>2024</risdate><abstract>We address the problem of multi-object 3D pose control in image diffusion models. Instead of conditioning on a sequence of text tokens, we propose to use a set of per-object representations, Neural Assets, to control the 3D pose of individual objects in a scene. Neural Assets are obtained by pooling visual representations of objects from a reference image, such as a frame in a video, and are trained to reconstruct the respective objects in a different image, e.g., a later frame in the video. Importantly, we encode object visuals from the reference image while conditioning on object poses from the target frame. This enables learning disentangled appearance and pose features. Combining visual and 3D pose representations in a sequence-of-tokens format allows us to keep the text-to-image architecture of existing models, with Neural Assets in place of text tokens. By fine-tuning a pre-trained text-to-image diffusion model with this information, our approach enables fine-grained 3D pose and placement control of individual objects in a scene. We further demonstrate that Neural Assets can be transferred and recomposed across different scenes. Our model achieves state-of-the-art multi-object editing results on both synthetic 3D scene datasets, as well as two real-world video datasets (Objectron, Waymo Open).</abstract><doi>10.48550/arxiv.2406.09292</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2406.09292
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2406_09292
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition Computer Science - Learning
title	Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-18T15%3A09%3A41IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Neural%20Assets:%203D-Aware%20Multi-Object%20Scene%20Synthesis%20with%20Image%20Diffusion%20Models&rft.au=Wu,%20Ziyi&rft.date=2024-06-13&rft_id=info:doi/10.48550/arxiv.2406.09292&rft_dat=%3Carxiv_GOX%3E2406_09292%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true