DiVE: DiT-based Video Generation with Enhanced Control

Generating high-fidelity, temporally consistent videos in autonomous driving scenarios faces a significant challenge, e.g. problematic maneuvers in corner cases. Despite recent video generation works are proposed to tackcle the mentioned problem, i.e. models built on top of Diffusion Transformers (D...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Jiang, Junpeng, Hong, Gangyi, Zhou, Lijun, Ma, Enhui, Hu, Hengtong, Zhou, Xia, Xiang, Jie, Liu, Fan, Yu, Kaicheng, Sun, Haiyang, Zhan, Kun, Jia, Peng, Zhang, Miao
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Jiang, Junpeng Hong, Gangyi Zhou, Lijun Ma, Enhui Hu, Hengtong Zhou, Xia Xiang, Jie Liu, Fan Yu, Kaicheng Sun, Haiyang Zhan, Kun Jia, Peng Zhang, Miao
description	Generating high-fidelity, temporally consistent videos in autonomous driving scenarios faces a significant challenge, e.g. problematic maneuvers in corner cases. Despite recent video generation works are proposed to tackcle the mentioned problem, i.e. models built on top of Diffusion Transformers (DiT), works are still missing which are targeted on exploring the potential for multi-view videos generation scenarios. Noticeably, we propose the first DiT-based framework specifically designed for generating temporally and multi-view consistent videos which precisely match the given bird's-eye view layouts control. Specifically, the proposed framework leverages a parameter-free spatial view-inflated attention mechanism to guarantee the cross-view consistency, where joint cross-attention modules and ControlNet-Transformer are integrated to further improve the precision of control. To demonstrate our advantages, we extensively investigate the qualitative comparisons on nuScenes dataset, particularly in some most challenging corner cases. In summary, the effectiveness of our proposed method in producing long, controllable, and highly consistent videos under difficult conditions is proven to be effective.
doi_str_mv	10.48550/arxiv.2409.01595
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2409_01595</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2409_01595</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2409_015953</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DMwNLU05WQwc8kMc7VScMkM0U1KLE5NUQjLTEnNV3BPzUstSizJzM9TKM8syVBwzctIzEsGSjvn55UU5efwMLCmJeYUp_JCaW4GeTfXEGcPXbAN8QVFmbmJRZXxIJviwTYZE1YBABKlMfg</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>DiVE: DiT-based Video Generation with Enhanced Control</title><source>arXiv.org</source><creator>Jiang, Junpeng ; Hong, Gangyi ; Zhou, Lijun ; Ma, Enhui ; Hu, Hengtong ; Zhou, Xia ; Xiang, Jie ; Liu, Fan ; Yu, Kaicheng ; Sun, Haiyang ; Zhan, Kun ; Jia, Peng ; Zhang, Miao</creator><creatorcontrib>Jiang, Junpeng ; Hong, Gangyi ; Zhou, Lijun ; Ma, Enhui ; Hu, Hengtong ; Zhou, Xia ; Xiang, Jie ; Liu, Fan ; Yu, Kaicheng ; Sun, Haiyang ; Zhan, Kun ; Jia, Peng ; Zhang, Miao</creatorcontrib><description>Generating high-fidelity, temporally consistent videos in autonomous driving scenarios faces a significant challenge, e.g. problematic maneuvers in corner cases. Despite recent video generation works are proposed to tackcle the mentioned problem, i.e. models built on top of Diffusion Transformers (DiT), works are still missing which are targeted on exploring the potential for multi-view videos generation scenarios. Noticeably, we propose the first DiT-based framework specifically designed for generating temporally and multi-view consistent videos which precisely match the given bird's-eye view layouts control. Specifically, the proposed framework leverages a parameter-free spatial view-inflated attention mechanism to guarantee the cross-view consistency, where joint cross-attention modules and ControlNet-Transformer are integrated to further improve the precision of control. To demonstrate our advantages, we extensively investigate the qualitative comparisons on nuScenes dataset, particularly in some most challenging corner cases. In summary, the effectiveness of our proposed method in producing long, controllable, and highly consistent videos under difficult conditions is proven to be effective.</description><identifier>DOI: 10.48550/arxiv.2409.01595</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-09</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2409.01595$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2409.01595$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Jiang, Junpeng</creatorcontrib><creatorcontrib>Hong, Gangyi</creatorcontrib><creatorcontrib>Zhou, Lijun</creatorcontrib><creatorcontrib>Ma, Enhui</creatorcontrib><creatorcontrib>Hu, Hengtong</creatorcontrib><creatorcontrib>Zhou, Xia</creatorcontrib><creatorcontrib>Xiang, Jie</creatorcontrib><creatorcontrib>Liu, Fan</creatorcontrib><creatorcontrib>Yu, Kaicheng</creatorcontrib><creatorcontrib>Sun, Haiyang</creatorcontrib><creatorcontrib>Zhan, Kun</creatorcontrib><creatorcontrib>Jia, Peng</creatorcontrib><creatorcontrib>Zhang, Miao</creatorcontrib><title>DiVE: DiT-based Video Generation with Enhanced Control</title><description>Generating high-fidelity, temporally consistent videos in autonomous driving scenarios faces a significant challenge, e.g. problematic maneuvers in corner cases. Despite recent video generation works are proposed to tackcle the mentioned problem, i.e. models built on top of Diffusion Transformers (DiT), works are still missing which are targeted on exploring the potential for multi-view videos generation scenarios. Noticeably, we propose the first DiT-based framework specifically designed for generating temporally and multi-view consistent videos which precisely match the given bird's-eye view layouts control. Specifically, the proposed framework leverages a parameter-free spatial view-inflated attention mechanism to guarantee the cross-view consistency, where joint cross-attention modules and ControlNet-Transformer are integrated to further improve the precision of control. To demonstrate our advantages, we extensively investigate the qualitative comparisons on nuScenes dataset, particularly in some most challenging corner cases. In summary, the effectiveness of our proposed method in producing long, controllable, and highly consistent videos under difficult conditions is proven to be effective.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DMwNLU05WQwc8kMc7VScMkM0U1KLE5NUQjLTEnNV3BPzUstSizJzM9TKM8syVBwzctIzEsGSjvn55UU5efwMLCmJeYUp_JCaW4GeTfXEGcPXbAN8QVFmbmJRZXxIJviwTYZE1YBABKlMfg</recordid><startdate>20240903</startdate><enddate>20240903</enddate><creator>Jiang, Junpeng</creator><creator>Hong, Gangyi</creator><creator>Zhou, Lijun</creator><creator>Ma, Enhui</creator><creator>Hu, Hengtong</creator><creator>Zhou, Xia</creator><creator>Xiang, Jie</creator><creator>Liu, Fan</creator><creator>Yu, Kaicheng</creator><creator>Sun, Haiyang</creator><creator>Zhan, Kun</creator><creator>Jia, Peng</creator><creator>Zhang, Miao</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240903</creationdate><title>DiVE: DiT-based Video Generation with Enhanced Control</title><author>Jiang, Junpeng ; Hong, Gangyi ; Zhou, Lijun ; Ma, Enhui ; Hu, Hengtong ; Zhou, Xia ; Xiang, Jie ; Liu, Fan ; Yu, Kaicheng ; Sun, Haiyang ; Zhan, Kun ; Jia, Peng ; Zhang, Miao</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2409_015953</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Jiang, Junpeng</creatorcontrib><creatorcontrib>Hong, Gangyi</creatorcontrib><creatorcontrib>Zhou, Lijun</creatorcontrib><creatorcontrib>Ma, Enhui</creatorcontrib><creatorcontrib>Hu, Hengtong</creatorcontrib><creatorcontrib>Zhou, Xia</creatorcontrib><creatorcontrib>Xiang, Jie</creatorcontrib><creatorcontrib>Liu, Fan</creatorcontrib><creatorcontrib>Yu, Kaicheng</creatorcontrib><creatorcontrib>Sun, Haiyang</creatorcontrib><creatorcontrib>Zhan, Kun</creatorcontrib><creatorcontrib>Jia, Peng</creatorcontrib><creatorcontrib>Zhang, Miao</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Jiang, Junpeng</au><au>Hong, Gangyi</au><au>Zhou, Lijun</au><au>Ma, Enhui</au><au>Hu, Hengtong</au><au>Zhou, Xia</au><au>Xiang, Jie</au><au>Liu, Fan</au><au>Yu, Kaicheng</au><au>Sun, Haiyang</au><au>Zhan, Kun</au><au>Jia, Peng</au><au>Zhang, Miao</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>DiVE: DiT-based Video Generation with Enhanced Control</atitle><date>2024-09-03</date><risdate>2024</risdate><abstract>Generating high-fidelity, temporally consistent videos in autonomous driving scenarios faces a significant challenge, e.g. problematic maneuvers in corner cases. Despite recent video generation works are proposed to tackcle the mentioned problem, i.e. models built on top of Diffusion Transformers (DiT), works are still missing which are targeted on exploring the potential for multi-view videos generation scenarios. Noticeably, we propose the first DiT-based framework specifically designed for generating temporally and multi-view consistent videos which precisely match the given bird's-eye view layouts control. Specifically, the proposed framework leverages a parameter-free spatial view-inflated attention mechanism to guarantee the cross-view consistency, where joint cross-attention modules and ControlNet-Transformer are integrated to further improve the precision of control. To demonstrate our advantages, we extensively investigate the qualitative comparisons on nuScenes dataset, particularly in some most challenging corner cases. In summary, the effectiveness of our proposed method in producing long, controllable, and highly consistent videos under difficult conditions is proven to be effective.</abstract><doi>10.48550/arxiv.2409.01595</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2409.01595
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2409_01595
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	DiVE: DiT-based Video Generation with Enhanced Control
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-08T15%3A22%3A02IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=DiVE:%20DiT-based%20Video%20Generation%20with%20Enhanced%20Control&rft.au=Jiang,%20Junpeng&rft.date=2024-09-03&rft_id=info:doi/10.48550/arxiv.2409.01595&rft_dat=%3Carxiv_GOX%3E2409_01595%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true