Learning Frame-Wise Emotion Intensity for Audio-Driven Talking-Head Generation

Human emotional expression is inherently dynamic, complex, and fluid, characterized by smooth transitions in intensity throughout verbal communication. However, the modeling of such intensity fluctuations has been largely overlooked by previous audio-driven talking-head generation methods, which oft...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Xu, Jingyi, Le, Hieu, Shu, Zhixin, Wang, Yang, Tsai, Yi-Hsuan, Samaras, Dimitris
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Sound
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Xu, Jingyi Le, Hieu Shu, Zhixin Wang, Yang Tsai, Yi-Hsuan Samaras, Dimitris
description	Human emotional expression is inherently dynamic, complex, and fluid, characterized by smooth transitions in intensity throughout verbal communication. However, the modeling of such intensity fluctuations has been largely overlooked by previous audio-driven talking-head generation methods, which often results in static emotional outputs. In this paper, we explore how emotion intensity fluctuates during speech, proposing a method for capturing and generating these subtle shifts for talking-head generation. Specifically, we develop a talking-head framework that is capable of generating a variety of emotions with precise control over intensity levels. This is achieved by learning a continuous emotion latent space, where emotion types are encoded within latent orientations and emotion intensity is reflected in latent norms. In addition, to capture the dynamic intensity fluctuations, we adopt an audio-to-intensity predictor by considering the speaking tone that reflects the intensity. The training signals for this predictor are obtained through our emotion-agnostic intensity pseudo-labeling method without the need of frame-wise intensity labeling. Extensive experiments and analyses validate the effectiveness of our proposed method in accurately capturing and reproducing emotion intensity fluctuations in talking-head generation, thereby significantly enhancing the expressiveness and realism of the generated outputs.
doi_str_mv	10.48550/arxiv.2409.19501
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2409_19501</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2409_19501</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2409_195013</originalsourceid><addsrcrecordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DO0NDUw5GTw80lNLMrLzEtXcCtKzE3VDc8sTlVwzc0vyczPU_DMK0nNK84sqVRIyy9ScCxNyczXdSnKLEvNUwhJzMkG6tL1SE1MUXBPzUstSgRp4WFgTUvMKU7lhdLcDPJuriHOHrpgm-MLijJzE4sq40EuiAe7wJiwCgBIRDuE</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Learning Frame-Wise Emotion Intensity for Audio-Driven Talking-Head Generation</title><source>arXiv.org</source><creator>Xu, Jingyi ; Le, Hieu ; Shu, Zhixin ; Wang, Yang ; Tsai, Yi-Hsuan ; Samaras, Dimitris</creator><creatorcontrib>Xu, Jingyi ; Le, Hieu ; Shu, Zhixin ; Wang, Yang ; Tsai, Yi-Hsuan ; Samaras, Dimitris</creatorcontrib><description>Human emotional expression is inherently dynamic, complex, and fluid, characterized by smooth transitions in intensity throughout verbal communication. However, the modeling of such intensity fluctuations has been largely overlooked by previous audio-driven talking-head generation methods, which often results in static emotional outputs. In this paper, we explore how emotion intensity fluctuates during speech, proposing a method for capturing and generating these subtle shifts for talking-head generation. Specifically, we develop a talking-head framework that is capable of generating a variety of emotions with precise control over intensity levels. This is achieved by learning a continuous emotion latent space, where emotion types are encoded within latent orientations and emotion intensity is reflected in latent norms. In addition, to capture the dynamic intensity fluctuations, we adopt an audio-to-intensity predictor by considering the speaking tone that reflects the intensity. The training signals for this predictor are obtained through our emotion-agnostic intensity pseudo-labeling method without the need of frame-wise intensity labeling. Extensive experiments and analyses validate the effectiveness of our proposed method in accurately capturing and reproducing emotion intensity fluctuations in talking-head generation, thereby significantly enhancing the expressiveness and realism of the generated outputs.</description><identifier>DOI: 10.48550/arxiv.2409.19501</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Sound</subject><creationdate>2024-09</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2409.19501$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2409.19501$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Xu, Jingyi</creatorcontrib><creatorcontrib>Le, Hieu</creatorcontrib><creatorcontrib>Shu, Zhixin</creatorcontrib><creatorcontrib>Wang, Yang</creatorcontrib><creatorcontrib>Tsai, Yi-Hsuan</creatorcontrib><creatorcontrib>Samaras, Dimitris</creatorcontrib><title>Learning Frame-Wise Emotion Intensity for Audio-Driven Talking-Head Generation</title><description>Human emotional expression is inherently dynamic, complex, and fluid, characterized by smooth transitions in intensity throughout verbal communication. However, the modeling of such intensity fluctuations has been largely overlooked by previous audio-driven talking-head generation methods, which often results in static emotional outputs. In this paper, we explore how emotion intensity fluctuates during speech, proposing a method for capturing and generating these subtle shifts for talking-head generation. Specifically, we develop a talking-head framework that is capable of generating a variety of emotions with precise control over intensity levels. This is achieved by learning a continuous emotion latent space, where emotion types are encoded within latent orientations and emotion intensity is reflected in latent norms. In addition, to capture the dynamic intensity fluctuations, we adopt an audio-to-intensity predictor by considering the speaking tone that reflects the intensity. The training signals for this predictor are obtained through our emotion-agnostic intensity pseudo-labeling method without the need of frame-wise intensity labeling. Extensive experiments and analyses validate the effectiveness of our proposed method in accurately capturing and reproducing emotion intensity fluctuations in talking-head generation, thereby significantly enhancing the expressiveness and realism of the generated outputs.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNpjYJA0NNAzsTA1NdBPLKrILNMzMjGw1DO0NDUw5GTw80lNLMrLzEtXcCtKzE3VDc8sTlVwzc0vyczPU_DMK0nNK84sqVRIyy9ScCxNyczXdSnKLEvNUwhJzMkG6tL1SE1MUXBPzUstSgRp4WFgTUvMKU7lhdLcDPJuriHOHrpgm-MLijJzE4sq40EuiAe7wJiwCgBIRDuE</recordid><startdate>20240928</startdate><enddate>20240928</enddate><creator>Xu, Jingyi</creator><creator>Le, Hieu</creator><creator>Shu, Zhixin</creator><creator>Wang, Yang</creator><creator>Tsai, Yi-Hsuan</creator><creator>Samaras, Dimitris</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240928</creationdate><title>Learning Frame-Wise Emotion Intensity for Audio-Driven Talking-Head Generation</title><author>Xu, Jingyi ; Le, Hieu ; Shu, Zhixin ; Wang, Yang ; Tsai, Yi-Hsuan ; Samaras, Dimitris</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2409_195013</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Xu, Jingyi</creatorcontrib><creatorcontrib>Le, Hieu</creatorcontrib><creatorcontrib>Shu, Zhixin</creatorcontrib><creatorcontrib>Wang, Yang</creatorcontrib><creatorcontrib>Tsai, Yi-Hsuan</creatorcontrib><creatorcontrib>Samaras, Dimitris</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Xu, Jingyi</au><au>Le, Hieu</au><au>Shu, Zhixin</au><au>Wang, Yang</au><au>Tsai, Yi-Hsuan</au><au>Samaras, Dimitris</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Learning Frame-Wise Emotion Intensity for Audio-Driven Talking-Head Generation</atitle><date>2024-09-28</date><risdate>2024</risdate><abstract>Human emotional expression is inherently dynamic, complex, and fluid, characterized by smooth transitions in intensity throughout verbal communication. However, the modeling of such intensity fluctuations has been largely overlooked by previous audio-driven talking-head generation methods, which often results in static emotional outputs. In this paper, we explore how emotion intensity fluctuates during speech, proposing a method for capturing and generating these subtle shifts for talking-head generation. Specifically, we develop a talking-head framework that is capable of generating a variety of emotions with precise control over intensity levels. This is achieved by learning a continuous emotion latent space, where emotion types are encoded within latent orientations and emotion intensity is reflected in latent norms. In addition, to capture the dynamic intensity fluctuations, we adopt an audio-to-intensity predictor by considering the speaking tone that reflects the intensity. The training signals for this predictor are obtained through our emotion-agnostic intensity pseudo-labeling method without the need of frame-wise intensity labeling. Extensive experiments and analyses validate the effectiveness of our proposed method in accurately capturing and reproducing emotion intensity fluctuations in talking-head generation, thereby significantly enhancing the expressiveness and realism of the generated outputs.</abstract><doi>10.48550/arxiv.2409.19501</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2409.19501
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2409_19501
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Sound
title	Learning Frame-Wise Emotion Intensity for Audio-Driven Talking-Head Generation
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-01T15%3A27%3A56IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Learning%20Frame-Wise%20Emotion%20Intensity%20for%20Audio-Driven%20Talking-Head%20Generation&rft.au=Xu,%20Jingyi&rft.date=2024-09-28&rft_id=info:doi/10.48550/arxiv.2409.19501&rft_dat=%3Carxiv_GOX%3E2409_19501%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true