Highly Detailed and Temporal Consistent Video Stylization via Synchronized Multi-Frame Diffusion

Text-guided video-to-video stylization transforms the visual appearance of a source video to a different appearance guided on textual prompts. Existing text-guided image diffusion models can be extended for stylized video synthesis. However, they struggle to generate videos with both highly detailed...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Xie, Minshan, Liu, Hanyuan, Li, Chengze, Wong, Tien-Tsin
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Xie, Minshan
Liu, Hanyuan
Li, Chengze
Wong, Tien-Tsin
description Text-guided video-to-video stylization transforms the visual appearance of a source video to a different appearance guided on textual prompts. Existing text-guided image diffusion models can be extended for stylized video synthesis. However, they struggle to generate videos with both highly detailed appearance and temporal consistency. In this paper, we propose a synchronized multi-frame diffusion framework to maintain both the visual details and the temporal consistency. Frames are denoised in a synchronous fashion, and more importantly, information of different frames is shared since the beginning of the denoising process. Such information sharing ensures that a consensus, in terms of the overall structure and color distribution, among frames can be reached in the early stage of the denoising process before it is too late. The optical flow from the original video serves as the connection, and hence the venue for information sharing, among frames. We demonstrate the effectiveness of our method in generating high-quality and diverse results in extensive experiments. Our method shows superior qualitative and quantitative results compared to state-of-the-art video editing methods.
doi_str_mv 10.48550/arxiv.2311.14343
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2311_14343</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2311_14343</sourcerecordid><originalsourceid>FETCH-LOGICAL-a673-40335a01e8f5e02adb32e16828db4d1529c9e5a1cf4cddf737d79346c87fcd1c3</originalsourceid><addsrcrecordid>eNotz7tOwzAYBWAvDKjwAEz4BRLi2M5lRCmlSK0YGrGmf31pf8lxKsetSJ-eUpjOcs6RPkKeWJaKSsrsBcI3ntOcM5YywQW_J9sl7g9uonMTAZ3RFLymremPQwBHm8GPOEbjI_1CbQa6iZPDC0QcPD0j0M3k1SEMHi_X6frkIiaLAL2hc7T2NF5rD-TOghvN43_OSLt4a5tlsvp8_2heVwkUJU9ExrmEjJnKSpPloHc8N6yo8krvhGYyr1VtJDBlhdLalrzUZc1FoarSKs0Un5Hnv9sbsTsG7CFM3S-1u1H5DzNqUEE</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Highly Detailed and Temporal Consistent Video Stylization via Synchronized Multi-Frame Diffusion</title><source>arXiv.org</source><creator>Xie, Minshan ; Liu, Hanyuan ; Li, Chengze ; Wong, Tien-Tsin</creator><creatorcontrib>Xie, Minshan ; Liu, Hanyuan ; Li, Chengze ; Wong, Tien-Tsin</creatorcontrib><description>Text-guided video-to-video stylization transforms the visual appearance of a source video to a different appearance guided on textual prompts. Existing text-guided image diffusion models can be extended for stylized video synthesis. However, they struggle to generate videos with both highly detailed appearance and temporal consistency. In this paper, we propose a synchronized multi-frame diffusion framework to maintain both the visual details and the temporal consistency. Frames are denoised in a synchronous fashion, and more importantly, information of different frames is shared since the beginning of the denoising process. Such information sharing ensures that a consensus, in terms of the overall structure and color distribution, among frames can be reached in the early stage of the denoising process before it is too late. The optical flow from the original video serves as the connection, and hence the venue for information sharing, among frames. We demonstrate the effectiveness of our method in generating high-quality and diverse results in extensive experiments. Our method shows superior qualitative and quantitative results compared to state-of-the-art video editing methods.</description><identifier>DOI: 10.48550/arxiv.2311.14343</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-11</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2311.14343$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2311.14343$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Xie, Minshan</creatorcontrib><creatorcontrib>Liu, Hanyuan</creatorcontrib><creatorcontrib>Li, Chengze</creatorcontrib><creatorcontrib>Wong, Tien-Tsin</creatorcontrib><title>Highly Detailed and Temporal Consistent Video Stylization via Synchronized Multi-Frame Diffusion</title><description>Text-guided video-to-video stylization transforms the visual appearance of a source video to a different appearance guided on textual prompts. Existing text-guided image diffusion models can be extended for stylized video synthesis. However, they struggle to generate videos with both highly detailed appearance and temporal consistency. In this paper, we propose a synchronized multi-frame diffusion framework to maintain both the visual details and the temporal consistency. Frames are denoised in a synchronous fashion, and more importantly, information of different frames is shared since the beginning of the denoising process. Such information sharing ensures that a consensus, in terms of the overall structure and color distribution, among frames can be reached in the early stage of the denoising process before it is too late. The optical flow from the original video serves as the connection, and hence the venue for information sharing, among frames. We demonstrate the effectiveness of our method in generating high-quality and diverse results in extensive experiments. Our method shows superior qualitative and quantitative results compared to state-of-the-art video editing methods.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz7tOwzAYBWAvDKjwAEz4BRLi2M5lRCmlSK0YGrGmf31pf8lxKsetSJ-eUpjOcs6RPkKeWJaKSsrsBcI3ntOcM5YywQW_J9sl7g9uonMTAZ3RFLymremPQwBHm8GPOEbjI_1CbQa6iZPDC0QcPD0j0M3k1SEMHi_X6frkIiaLAL2hc7T2NF5rD-TOghvN43_OSLt4a5tlsvp8_2heVwkUJU9ExrmEjJnKSpPloHc8N6yo8krvhGYyr1VtJDBlhdLalrzUZc1FoarSKs0Un5Hnv9sbsTsG7CFM3S-1u1H5DzNqUEE</recordid><startdate>20231124</startdate><enddate>20231124</enddate><creator>Xie, Minshan</creator><creator>Liu, Hanyuan</creator><creator>Li, Chengze</creator><creator>Wong, Tien-Tsin</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231124</creationdate><title>Highly Detailed and Temporal Consistent Video Stylization via Synchronized Multi-Frame Diffusion</title><author>Xie, Minshan ; Liu, Hanyuan ; Li, Chengze ; Wong, Tien-Tsin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a673-40335a01e8f5e02adb32e16828db4d1529c9e5a1cf4cddf737d79346c87fcd1c3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Xie, Minshan</creatorcontrib><creatorcontrib>Liu, Hanyuan</creatorcontrib><creatorcontrib>Li, Chengze</creatorcontrib><creatorcontrib>Wong, Tien-Tsin</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Xie, Minshan</au><au>Liu, Hanyuan</au><au>Li, Chengze</au><au>Wong, Tien-Tsin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Highly Detailed and Temporal Consistent Video Stylization via Synchronized Multi-Frame Diffusion</atitle><date>2023-11-24</date><risdate>2023</risdate><abstract>Text-guided video-to-video stylization transforms the visual appearance of a source video to a different appearance guided on textual prompts. Existing text-guided image diffusion models can be extended for stylized video synthesis. However, they struggle to generate videos with both highly detailed appearance and temporal consistency. In this paper, we propose a synchronized multi-frame diffusion framework to maintain both the visual details and the temporal consistency. Frames are denoised in a synchronous fashion, and more importantly, information of different frames is shared since the beginning of the denoising process. Such information sharing ensures that a consensus, in terms of the overall structure and color distribution, among frames can be reached in the early stage of the denoising process before it is too late. The optical flow from the original video serves as the connection, and hence the venue for information sharing, among frames. We demonstrate the effectiveness of our method in generating high-quality and diverse results in extensive experiments. Our method shows superior qualitative and quantitative results compared to state-of-the-art video editing methods.</abstract><doi>10.48550/arxiv.2311.14343</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2311.14343
ispartof
issn
language eng
recordid cdi_arxiv_primary_2311_14343
source arXiv.org
subjects Computer Science - Computer Vision and Pattern Recognition
title Highly Detailed and Temporal Consistent Video Stylization via Synchronized Multi-Frame Diffusion
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-27T13%3A18%3A12IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Highly%20Detailed%20and%20Temporal%20Consistent%20Video%20Stylization%20via%20Synchronized%20Multi-Frame%20Diffusion&rft.au=Xie,%20Minshan&rft.date=2023-11-24&rft_id=info:doi/10.48550/arxiv.2311.14343&rft_dat=%3Carxiv_GOX%3E2311_14343%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true