V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation

In the field of portrait video generation, the use of single images to generate portrait videos has become increasingly prevalent. A common approach involves leveraging generative models to enhance adapters for controlled generation. However, control signals (e.g., text, audio, reference image, pose...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Wang, Cong, Tian, Kuan, Zhang, Jun, Guan, Yonghang, Luo, Feng, Shen, Fei, Jiang, Zhiwei, Gu, Qing, Han, Xiao, Yang, Wei
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Wang, Cong
Tian, Kuan
Zhang, Jun
Guan, Yonghang
Luo, Feng
Shen, Fei
Jiang, Zhiwei
Gu, Qing
Han, Xiao
Yang, Wei
description In the field of portrait video generation, the use of single images to generate portrait videos has become increasingly prevalent. A common approach involves leveraging generative models to enhance adapters for controlled generation. However, control signals (e.g., text, audio, reference image, pose, depth map, etc.) can vary in strength. Among these, weaker conditions often struggle to be effective due to interference from stronger conditions, posing a challenge in balancing these conditions. In our work on portrait video generation, we identified audio signals as particularly weak, often overshadowed by stronger signals such as facial pose and reference image. However, direct training with weak signals often leads to difficulties in convergence. To address this, we propose V-Express, a simple method that balances different control signals through the progressive training and the conditional dropout operation. Our method gradually enables effective control by weak conditions, thereby achieving generation capabilities that simultaneously take into account the facial pose, reference image, and audio. The experimental results demonstrate that our method can effectively generate portrait videos controlled by audio. Furthermore, a potential solution is provided for the simultaneous and effective use of conditions of varying strengths.
doi_str_mv 10.48550/arxiv.2406.02511
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2406_02511</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2406_02511</sourcerecordid><originalsourceid>FETCH-LOGICAL-a671-e555a007df49aa7ca0d94dd1832b7ff99b4980554edec3b50bf3b28e1573a64d3</originalsourceid><addsrcrecordid>eNotj0FOwzAURL1hgVoOwApfIMGO7TjuDoVSkCq1i6jb6Lv-riwVO3JCVW4PKaxGI80b6RHyyFkpG6XYM-RruJSVZHXJKsX5PekOxfo6ZBzHFW1TdGEKKcKZvuY0pK-J-pTpPqfTvAgXpF2GEEM80eTpPuXpt070EBwmusGIGWZ-Se48nEd8-M8F6d7WXftebHebj_ZlW0CteYFKKWBMOy8NgD4Cc0Y6xxtRWe29MVaahikl0eFRWMWsF7ZqkCstoJZOLMjT3-1Nqx9y-IT83c96_U1P_ADfaEua</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation</title><source>arXiv.org</source><creator>Wang, Cong ; Tian, Kuan ; Zhang, Jun ; Guan, Yonghang ; Luo, Feng ; Shen, Fei ; Jiang, Zhiwei ; Gu, Qing ; Han, Xiao ; Yang, Wei</creator><creatorcontrib>Wang, Cong ; Tian, Kuan ; Zhang, Jun ; Guan, Yonghang ; Luo, Feng ; Shen, Fei ; Jiang, Zhiwei ; Gu, Qing ; Han, Xiao ; Yang, Wei</creatorcontrib><description>In the field of portrait video generation, the use of single images to generate portrait videos has become increasingly prevalent. A common approach involves leveraging generative models to enhance adapters for controlled generation. However, control signals (e.g., text, audio, reference image, pose, depth map, etc.) can vary in strength. Among these, weaker conditions often struggle to be effective due to interference from stronger conditions, posing a challenge in balancing these conditions. In our work on portrait video generation, we identified audio signals as particularly weak, often overshadowed by stronger signals such as facial pose and reference image. However, direct training with weak signals often leads to difficulties in convergence. To address this, we propose V-Express, a simple method that balances different control signals through the progressive training and the conditional dropout operation. Our method gradually enables effective control by weak conditions, thereby achieving generation capabilities that simultaneously take into account the facial pose, reference image, and audio. The experimental results demonstrate that our method can effectively generate portrait videos controlled by audio. Furthermore, a potential solution is provided for the simultaneous and effective use of conditions of varying strengths.</description><identifier>DOI: 10.48550/arxiv.2406.02511</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-06</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2406.02511$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2406.02511$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Wang, Cong</creatorcontrib><creatorcontrib>Tian, Kuan</creatorcontrib><creatorcontrib>Zhang, Jun</creatorcontrib><creatorcontrib>Guan, Yonghang</creatorcontrib><creatorcontrib>Luo, Feng</creatorcontrib><creatorcontrib>Shen, Fei</creatorcontrib><creatorcontrib>Jiang, Zhiwei</creatorcontrib><creatorcontrib>Gu, Qing</creatorcontrib><creatorcontrib>Han, Xiao</creatorcontrib><creatorcontrib>Yang, Wei</creatorcontrib><title>V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation</title><description>In the field of portrait video generation, the use of single images to generate portrait videos has become increasingly prevalent. A common approach involves leveraging generative models to enhance adapters for controlled generation. However, control signals (e.g., text, audio, reference image, pose, depth map, etc.) can vary in strength. Among these, weaker conditions often struggle to be effective due to interference from stronger conditions, posing a challenge in balancing these conditions. In our work on portrait video generation, we identified audio signals as particularly weak, often overshadowed by stronger signals such as facial pose and reference image. However, direct training with weak signals often leads to difficulties in convergence. To address this, we propose V-Express, a simple method that balances different control signals through the progressive training and the conditional dropout operation. Our method gradually enables effective control by weak conditions, thereby achieving generation capabilities that simultaneously take into account the facial pose, reference image, and audio. The experimental results demonstrate that our method can effectively generate portrait videos controlled by audio. Furthermore, a potential solution is provided for the simultaneous and effective use of conditions of varying strengths.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj0FOwzAURL1hgVoOwApfIMGO7TjuDoVSkCq1i6jb6Lv-riwVO3JCVW4PKaxGI80b6RHyyFkpG6XYM-RruJSVZHXJKsX5PekOxfo6ZBzHFW1TdGEKKcKZvuY0pK-J-pTpPqfTvAgXpF2GEEM80eTpPuXpt070EBwmusGIGWZ-Se48nEd8-M8F6d7WXftebHebj_ZlW0CteYFKKWBMOy8NgD4Cc0Y6xxtRWe29MVaahikl0eFRWMWsF7ZqkCstoJZOLMjT3-1Nqx9y-IT83c96_U1P_ADfaEua</recordid><startdate>20240604</startdate><enddate>20240604</enddate><creator>Wang, Cong</creator><creator>Tian, Kuan</creator><creator>Zhang, Jun</creator><creator>Guan, Yonghang</creator><creator>Luo, Feng</creator><creator>Shen, Fei</creator><creator>Jiang, Zhiwei</creator><creator>Gu, Qing</creator><creator>Han, Xiao</creator><creator>Yang, Wei</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240604</creationdate><title>V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation</title><author>Wang, Cong ; Tian, Kuan ; Zhang, Jun ; Guan, Yonghang ; Luo, Feng ; Shen, Fei ; Jiang, Zhiwei ; Gu, Qing ; Han, Xiao ; Yang, Wei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a671-e555a007df49aa7ca0d94dd1832b7ff99b4980554edec3b50bf3b28e1573a64d3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Wang, Cong</creatorcontrib><creatorcontrib>Tian, Kuan</creatorcontrib><creatorcontrib>Zhang, Jun</creatorcontrib><creatorcontrib>Guan, Yonghang</creatorcontrib><creatorcontrib>Luo, Feng</creatorcontrib><creatorcontrib>Shen, Fei</creatorcontrib><creatorcontrib>Jiang, Zhiwei</creatorcontrib><creatorcontrib>Gu, Qing</creatorcontrib><creatorcontrib>Han, Xiao</creatorcontrib><creatorcontrib>Yang, Wei</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Wang, Cong</au><au>Tian, Kuan</au><au>Zhang, Jun</au><au>Guan, Yonghang</au><au>Luo, Feng</au><au>Shen, Fei</au><au>Jiang, Zhiwei</au><au>Gu, Qing</au><au>Han, Xiao</au><au>Yang, Wei</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation</atitle><date>2024-06-04</date><risdate>2024</risdate><abstract>In the field of portrait video generation, the use of single images to generate portrait videos has become increasingly prevalent. A common approach involves leveraging generative models to enhance adapters for controlled generation. However, control signals (e.g., text, audio, reference image, pose, depth map, etc.) can vary in strength. Among these, weaker conditions often struggle to be effective due to interference from stronger conditions, posing a challenge in balancing these conditions. In our work on portrait video generation, we identified audio signals as particularly weak, often overshadowed by stronger signals such as facial pose and reference image. However, direct training with weak signals often leads to difficulties in convergence. To address this, we propose V-Express, a simple method that balances different control signals through the progressive training and the conditional dropout operation. Our method gradually enables effective control by weak conditions, thereby achieving generation capabilities that simultaneously take into account the facial pose, reference image, and audio. The experimental results demonstrate that our method can effectively generate portrait videos controlled by audio. Furthermore, a potential solution is provided for the simultaneous and effective use of conditions of varying strengths.</abstract><doi>10.48550/arxiv.2406.02511</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2406.02511
ispartof
issn
language eng
recordid cdi_arxiv_primary_2406_02511
source arXiv.org
subjects Computer Science - Artificial Intelligence
Computer Science - Computer Vision and Pattern Recognition
title V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-09T13%3A06%3A05IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=V-Express:%20Conditional%20Dropout%20for%20Progressive%20Training%20of%20Portrait%20Video%20Generation&rft.au=Wang,%20Cong&rft.date=2024-06-04&rft_id=info:doi/10.48550/arxiv.2406.02511&rft_dat=%3Carxiv_GOX%3E2406_02511%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true