Highly Detailed and Temporal Consistent Video Stylization via Synchronized Multi-Frame Diffusion

Text-guided video-to-video stylization transforms the visual appearance of a source video to a different appearance guided on textual prompts. Existing text-guided image diffusion models can be extended for stylized video synthesis. However, they struggle to generate videos with both highly detailed...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Xie, Minshan, Liu, Hanyuan, Li, Chengze, Wong, Tien-Tsin
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Xie, Minshan Liu, Hanyuan Li, Chengze Wong, Tien-Tsin
description	Text-guided video-to-video stylization transforms the visual appearance of a source video to a different appearance guided on textual prompts. Existing text-guided image diffusion models can be extended for stylized video synthesis. However, they struggle to generate videos with both highly detailed appearance and temporal consistency. In this paper, we propose a synchronized multi-frame diffusion framework to maintain both the visual details and the temporal consistency. Frames are denoised in a synchronous fashion, and more importantly, information of different frames is shared since the beginning of the denoising process. Such information sharing ensures that a consensus, in terms of the overall structure and color distribution, among frames can be reached in the early stage of the denoising process before it is too late. The optical flow from the original video serves as the connection, and hence the venue for information sharing, among frames. We demonstrate the effectiveness of our method in generating high-quality and diverse results in extensive experiments. Our method shows superior qualitative and quantitative results compared to state-of-the-art video editing methods.
doi_str_mv	10.48550/arxiv.2311.14343
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2311_14343</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2311_14343</sourcerecordid><originalsourceid>FETCH-LOGICAL-a673-40335a01e8f5e02adb32e16828db4d1529c9e5a1cf4cddf737d79346c87fcd1c3</originalsourceid><addsrcrecordid>eNotz7tOwzAYBWAvDKjwAEz4BRLi2M5lRCmlSK0YGrGmf31pf8lxKsetSJ-eUpjOcs6RPkKeWJaKSsrsBcI3ntOcM5YywQW_J9sl7g9uonMTAZ3RFLymremPQwBHm8GPOEbjI_1CbQa6iZPDC0QcPD0j0M3k1SEMHi_X6frkIiaLAL2hc7T2NF5rD-TOghvN43_OSLt4a5tlsvp8_2heVwkUJU9ExrmEjJnKSpPloHc8N6yo8krvhGYyr1VtJDBlhdLalrzUZc1FoarSKs0Un5Hnv9sbsTsG7CFM3S-1u1H5DzNqUEE</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Highly Detailed and Temporal Consistent Video Stylization via Synchronized Multi-Frame Diffusion</title><source>arXiv.org</source><creator>Xie, Minshan ; Liu, Hanyuan ; Li, Chengze ; Wong, Tien-Tsin</creator><creatorcontrib>Xie, Minshan ; Liu, Hanyuan ; Li, Chengze ; Wong, Tien-Tsin</creatorcontrib><description>Text-guided video-to-video stylization transforms the visual appearance of a source video to a different appearance guided on textual prompts. Existing text-guided image diffusion models can be extended for stylized video synthesis. However, they struggle to generate videos with both highly detailed appearance and temporal consistency. In this paper, we propose a synchronized multi-frame diffusion framework to maintain both the visual details and the temporal consistency. Frames are denoised in a synchronous fashion, and more importantly, information of different frames is shared since the beginning of the denoising process. Such information sharing ensures that a consensus, in terms of the overall structure and color distribution, among frames can be reached in the early stage of the denoising process before it is too late. The optical flow from the original video serves as the connection, and hence the venue for information sharing, among frames. We demonstrate the effectiveness of our method in generating high-quality and diverse results in extensive experiments. Our method shows superior qualitative and quantitative results compared to state-of-the-art video editing methods.</description><identifier>DOI: 10.48550/arxiv.2311.14343</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-11</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2311.14343$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2311.14343$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Xie, Minshan</creatorcontrib><creatorcontrib>Liu, Hanyuan</creatorcontrib><creatorcontrib>Li, Chengze</creatorcontrib><creatorcontrib>Wong, Tien-Tsin</creatorcontrib><title>Highly Detailed and Temporal Consistent Video Stylization via Synchronized Multi-Frame Diffusion</title><description>Text-guided video-to-video stylization transforms the visual appearance of a source video to a different appearance guided on textual prompts. Existing text-guided image diffusion models can be extended for stylized video synthesis. However, they struggle to generate videos with both highly detailed appearance and temporal consistency. In this paper, we propose a synchronized multi-frame diffusion framework to maintain both the visual details and the temporal consistency. Frames are denoised in a synchronous fashion, and more importantly, information of different frames is shared since the beginning of the denoising process. Such information sharing ensures that a consensus, in terms of the overall structure and color distribution, among frames can be reached in the early stage of the denoising process before it is too late. The optical flow from the original video serves as the connection, and hence the venue for information sharing, among frames. We demonstrate the effectiveness of our method in generating high-quality and diverse results in extensive experiments. Our method shows superior qualitative and quantitative results compared to state-of-the-art video editing methods.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz7tOwzAYBWAvDKjwAEz4BRLi2M5lRCmlSK0YGrGmf31pf8lxKsetSJ-eUpjOcs6RPkKeWJaKSsrsBcI3ntOcM5YywQW_J9sl7g9uonMTAZ3RFLymremPQwBHm8GPOEbjI_1CbQa6iZPDC0QcPD0j0M3k1SEMHi_X6frkIiaLAL2hc7T2NF5rD-TOghvN43_OSLt4a5tlsvp8_2heVwkUJU9ExrmEjJnKSpPloHc8N6yo8krvhGYyr1VtJDBlhdLalrzUZc1FoarSKs0Un5Hnv9sbsTsG7CFM3S-1u1H5DzNqUEE</recordid><startdate>20231124</startdate><enddate>20231124</enddate><creator>Xie, Minshan</creator><creator>Liu, Hanyuan</creator><creator>Li, Chengze</creator><creator>Wong, Tien-Tsin</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231124</creationdate><title>Highly Detailed and Temporal Consistent Video Stylization via Synchronized Multi-Frame Diffusion</title><author>Xie, Minshan ; Liu, Hanyuan ; Li, Chengze ; Wong, Tien-Tsin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a673-40335a01e8f5e02adb32e16828db4d1529c9e5a1cf4cddf737d79346c87fcd1c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Xie, Minshan</creatorcontrib><creatorcontrib>Liu, Hanyuan</creatorcontrib><creatorcontrib>Li, Chengze</creatorcontrib><creatorcontrib>Wong, Tien-Tsin</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Xie, Minshan</au><au>Liu, Hanyuan</au><au>Li, Chengze</au><au>Wong, Tien-Tsin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Highly Detailed and Temporal Consistent Video Stylization via Synchronized Multi-Frame Diffusion</atitle><date>2023-11-24</date><risdate>2023</risdate><abstract>Text-guided video-to-video stylization transforms the visual appearance of a source video to a different appearance guided on textual prompts. Existing text-guided image diffusion models can be extended for stylized video synthesis. However, they struggle to generate videos with both highly detailed appearance and temporal consistency. In this paper, we propose a synchronized multi-frame diffusion framework to maintain both the visual details and the temporal consistency. Frames are denoised in a synchronous fashion, and more importantly, information of different frames is shared since the beginning of the denoising process. Such information sharing ensures that a consensus, in terms of the overall structure and color distribution, among frames can be reached in the early stage of the denoising process before it is too late. The optical flow from the original video serves as the connection, and hence the venue for information sharing, among frames. We demonstrate the effectiveness of our method in generating high-quality and diverse results in extensive experiments. Our method shows superior qualitative and quantitative results compared to state-of-the-art video editing methods.</abstract><doi>10.48550/arxiv.2311.14343</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2311.14343
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2311_14343
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	Highly Detailed and Temporal Consistent Video Stylization via Synchronized Multi-Frame Diffusion
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-27T13%3A18%3A12IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Highly%20Detailed%20and%20Temporal%20Consistent%20Video%20Stylization%20via%20Synchronized%20Multi-Frame%20Diffusion&rft.au=Xie,%20Minshan&rft.date=2023-11-24&rft_id=info:doi/10.48550/arxiv.2311.14343&rft_dat=%3Carxiv_GOX%3E2311_14343%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true