V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation

In the field of portrait video generation, the use of single images to generate portrait videos has become increasingly prevalent. A common approach involves leveraging generative models to enhance adapters for controlled generation. However, control signals (e.g., text, audio, reference image, pose...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-06
Hauptverfasser:	Wang, Cong, Tian, Kuan, Zhang, Jun, Guan, Yonghang, Luo, Feng, Shen, Fei, Jiang, Zhiwei, Gu, Qing, Han, Xiao, Yang, Wei
Format:	Artikel
Sprache:	eng
Schlagworte:	Audio signals Signal generation Video
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Wang, Cong Tian, Kuan Zhang, Jun Guan, Yonghang Luo, Feng Shen, Fei Jiang, Zhiwei Gu, Qing Han, Xiao Yang, Wei
description	In the field of portrait video generation, the use of single images to generate portrait videos has become increasingly prevalent. A common approach involves leveraging generative models to enhance adapters for controlled generation. However, control signals (e.g., text, audio, reference image, pose, depth map, etc.) can vary in strength. Among these, weaker conditions often struggle to be effective due to interference from stronger conditions, posing a challenge in balancing these conditions. In our work on portrait video generation, we identified audio signals as particularly weak, often overshadowed by stronger signals such as facial pose and reference image. However, direct training with weak signals often leads to difficulties in convergence. To address this, we propose V-Express, a simple method that balances different control signals through the progressive training and the conditional dropout operation. Our method gradually enables effective control by weak conditions, thereby achieving generation capabilities that simultaneously take into account the facial pose, reference image, and audio. The experimental results demonstrate that our method can effectively generate portrait videos controlled by audio. Furthermore, a potential solution is provided for the simultaneous and effective use of conditions of varying strengths.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3064739331</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3064739331</sourcerecordid><originalsourceid>FETCH-proquest_journals_30647393313</originalsourceid><addsrcrecordid>eNqNjs0KgkAUhYcgSMp3uNBamGb8qbZmtXQhbmXAUUZkrt0Zo8dPoQdodTjwfZyzYYGQ8hSdYyF2LHRu4JyLNBNJIgNW1VHxmUg7d4UcbWu8QatGuBFOOHvokKAk7FfCvDVUpIw1tgfsoETyS_VQm1YjPLTVpFb_wLadGp0Of7lnx3tR5c9oInzN2vlmwJmWGddInsaZvCwH5X_UF4kyQJg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3064739331</pqid></control><display><type>article</type><title>V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation</title><source>Free E- Journals</source><creator>Wang, Cong ; Tian, Kuan ; Zhang, Jun ; Guan, Yonghang ; Luo, Feng ; Shen, Fei ; Jiang, Zhiwei ; Gu, Qing ; Han, Xiao ; Yang, Wei</creator><creatorcontrib>Wang, Cong ; Tian, Kuan ; Zhang, Jun ; Guan, Yonghang ; Luo, Feng ; Shen, Fei ; Jiang, Zhiwei ; Gu, Qing ; Han, Xiao ; Yang, Wei</creatorcontrib><description>In the field of portrait video generation, the use of single images to generate portrait videos has become increasingly prevalent. A common approach involves leveraging generative models to enhance adapters for controlled generation. However, control signals (e.g., text, audio, reference image, pose, depth map, etc.) can vary in strength. Among these, weaker conditions often struggle to be effective due to interference from stronger conditions, posing a challenge in balancing these conditions. In our work on portrait video generation, we identified audio signals as particularly weak, often overshadowed by stronger signals such as facial pose and reference image. However, direct training with weak signals often leads to difficulties in convergence. To address this, we propose V-Express, a simple method that balances different control signals through the progressive training and the conditional dropout operation. Our method gradually enables effective control by weak conditions, thereby achieving generation capabilities that simultaneously take into account the facial pose, reference image, and audio. The experimental results demonstrate that our method can effectively generate portrait videos controlled by audio. Furthermore, a potential solution is provided for the simultaneous and effective use of conditions of varying strengths.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Audio signals ; Signal generation ; Video</subject><ispartof>arXiv.org, 2024-06</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>776,780</link.rule.ids></links><search><creatorcontrib>Wang, Cong</creatorcontrib><creatorcontrib>Tian, Kuan</creatorcontrib><creatorcontrib>Zhang, Jun</creatorcontrib><creatorcontrib>Guan, Yonghang</creatorcontrib><creatorcontrib>Luo, Feng</creatorcontrib><creatorcontrib>Shen, Fei</creatorcontrib><creatorcontrib>Jiang, Zhiwei</creatorcontrib><creatorcontrib>Gu, Qing</creatorcontrib><creatorcontrib>Han, Xiao</creatorcontrib><creatorcontrib>Yang, Wei</creatorcontrib><title>V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation</title><title>arXiv.org</title><description>In the field of portrait video generation, the use of single images to generate portrait videos has become increasingly prevalent. A common approach involves leveraging generative models to enhance adapters for controlled generation. However, control signals (e.g., text, audio, reference image, pose, depth map, etc.) can vary in strength. Among these, weaker conditions often struggle to be effective due to interference from stronger conditions, posing a challenge in balancing these conditions. In our work on portrait video generation, we identified audio signals as particularly weak, often overshadowed by stronger signals such as facial pose and reference image. However, direct training with weak signals often leads to difficulties in convergence. To address this, we propose V-Express, a simple method that balances different control signals through the progressive training and the conditional dropout operation. Our method gradually enables effective control by weak conditions, thereby achieving generation capabilities that simultaneously take into account the facial pose, reference image, and audio. The experimental results demonstrate that our method can effectively generate portrait videos controlled by audio. Furthermore, a potential solution is provided for the simultaneous and effective use of conditions of varying strengths.</description><subject>Audio signals</subject><subject>Signal generation</subject><subject>Video</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNqNjs0KgkAUhYcgSMp3uNBamGb8qbZmtXQhbmXAUUZkrt0Zo8dPoQdodTjwfZyzYYGQ8hSdYyF2LHRu4JyLNBNJIgNW1VHxmUg7d4UcbWu8QatGuBFOOHvokKAk7FfCvDVUpIw1tgfsoETyS_VQm1YjPLTVpFb_wLadGp0Of7lnx3tR5c9oInzN2vlmwJmWGddInsaZvCwH5X_UF4kyQJg</recordid><startdate>20240604</startdate><enddate>20240604</enddate><creator>Wang, Cong</creator><creator>Tian, Kuan</creator><creator>Zhang, Jun</creator><creator>Guan, Yonghang</creator><creator>Luo, Feng</creator><creator>Shen, Fei</creator><creator>Jiang, Zhiwei</creator><creator>Gu, Qing</creator><creator>Han, Xiao</creator><creator>Yang, Wei</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20240604</creationdate><title>V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation</title><author>Wang, Cong ; Tian, Kuan ; Zhang, Jun ; Guan, Yonghang ; Luo, Feng ; Shen, Fei ; Jiang, Zhiwei ; Gu, Qing ; Han, Xiao ; Yang, Wei</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_30647393313</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Audio signals</topic><topic>Signal generation</topic><topic>Video</topic><toplevel>online_resources</toplevel><creatorcontrib>Wang, Cong</creatorcontrib><creatorcontrib>Tian, Kuan</creatorcontrib><creatorcontrib>Zhang, Jun</creatorcontrib><creatorcontrib>Guan, Yonghang</creatorcontrib><creatorcontrib>Luo, Feng</creatorcontrib><creatorcontrib>Shen, Fei</creatorcontrib><creatorcontrib>Jiang, Zhiwei</creatorcontrib><creatorcontrib>Gu, Qing</creatorcontrib><creatorcontrib>Han, Xiao</creatorcontrib><creatorcontrib>Yang, Wei</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Wang, Cong</au><au>Tian, Kuan</au><au>Zhang, Jun</au><au>Guan, Yonghang</au><au>Luo, Feng</au><au>Shen, Fei</au><au>Jiang, Zhiwei</au><au>Gu, Qing</au><au>Han, Xiao</au><au>Yang, Wei</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation</atitle><jtitle>arXiv.org</jtitle><date>2024-06-04</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>In the field of portrait video generation, the use of single images to generate portrait videos has become increasingly prevalent. A common approach involves leveraging generative models to enhance adapters for controlled generation. However, control signals (e.g., text, audio, reference image, pose, depth map, etc.) can vary in strength. Among these, weaker conditions often struggle to be effective due to interference from stronger conditions, posing a challenge in balancing these conditions. In our work on portrait video generation, we identified audio signals as particularly weak, often overshadowed by stronger signals such as facial pose and reference image. However, direct training with weak signals often leads to difficulties in convergence. To address this, we propose V-Express, a simple method that balances different control signals through the progressive training and the conditional dropout operation. Our method gradually enables effective control by weak conditions, thereby achieving generation capabilities that simultaneously take into account the facial pose, reference image, and audio. The experimental results demonstrate that our method can effectively generate portrait videos controlled by audio. Furthermore, a potential solution is provided for the simultaneous and effective use of conditions of varying strengths.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-06
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3064739331
source	Free E- Journals
subjects	Audio signals Signal generation Video
title	V-Express: Conditional Dropout for Progressive Training of Portrait Video Generation
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-03T14%3A08%3A35IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=V-Express:%20Conditional%20Dropout%20for%20Progressive%20Training%20of%20Portrait%20Video%20Generation&rft.jtitle=arXiv.org&rft.au=Wang,%20Cong&rft.date=2024-06-04&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3064739331%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3064739331&rft_id=info:pmid/&rfr_iscdi=true