GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

Speech-driven talking head generation is an important but challenging task for many downstream applications such as augmented reality. Existing methods have achieved remarkable performance by utilizing autoregressive models or diffusion models. However, most still suffer from modality inconsistencie...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Lin, Yihong, Fan, Zhaoxin, Xiong, Lingyu, Peng, Liang, Li, Xiandong, Kang, Wenxiong, Wu, Xianjia, Lei, Songju, Xu, Huang
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Lin, Yihong Fan, Zhaoxin Xiong, Lingyu Peng, Liang Li, Xiandong Kang, Wenxiong Wu, Xianjia Lei, Songju Xu, Huang
description	Speech-driven talking head generation is an important but challenging task for many downstream applications such as augmented reality. Existing methods have achieved remarkable performance by utilizing autoregressive models or diffusion models. However, most still suffer from modality inconsistencies, specifically the misalignment between audio and mesh modalities, which causes inconsistencies in motion diversity and lip-sync accuracy. To address this issue, this paper introduces GLDiTalker, a novel speech-driven 3D facial animation model that employs a Graph Latent Diffusion Transformer. The core idea behind GLDiTalker is that the audio-mesh modality misalignment can be resolved by diffusing the signal in a latent quantilized spatial-temporal space. To achieve this, GLDiTalker builds upon a quantilized space-time diffusion training pipeline, which consists of a Graph Enhanced Quantilized Space Learning Stage and a Space-Time Powered Latent Diffusion Stage. The first stage ensures lip-sync accuracy, while the second stage enhances motion diversity. Together, these stages enable GLDiTalker to generate temporally and spatially stable, realistic models. Extensive evaluations on several widely used benchmarks demonstrate that our method achieves superior performance compared to existing methods.
doi_str_mv	10.48550/arxiv.2408.01826
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2408_01826</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2408_01826</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2408_018263</originalsourceid><addsrcrecordid>eNqFzrEOgkAQBNBrLIz6AVbuD4AgYIid8QQLOrEmG3IXNsJBlhP17xVibzXFTCZPiLXvuWEcRd4W-UWDuwu92PX8eLefi1uaScqxvis-wLVTqqwcyTQoA4GEBEvCGo6GGrTUGniSrSBl7CrI0CpjQZLWj37sckbT65YbxUsx01j3avXLhdgk5_x0cSZA0fH3j9_FCCkmSPB_8QGsCz13</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer</title><source>arXiv.org</source><creator>Lin, Yihong ; Fan, Zhaoxin ; Xiong, Lingyu ; Peng, Liang ; Li, Xiandong ; Kang, Wenxiong ; Wu, Xianjia ; Lei, Songju ; Xu, Huang</creator><creatorcontrib>Lin, Yihong ; Fan, Zhaoxin ; Xiong, Lingyu ; Peng, Liang ; Li, Xiandong ; Kang, Wenxiong ; Wu, Xianjia ; Lei, Songju ; Xu, Huang</creatorcontrib><description>Speech-driven talking head generation is an important but challenging task for many downstream applications such as augmented reality. Existing methods have achieved remarkable performance by utilizing autoregressive models or diffusion models. However, most still suffer from modality inconsistencies, specifically the misalignment between audio and mesh modalities, which causes inconsistencies in motion diversity and lip-sync accuracy. To address this issue, this paper introduces GLDiTalker, a novel speech-driven 3D facial animation model that employs a Graph Latent Diffusion Transformer. The core idea behind GLDiTalker is that the audio-mesh modality misalignment can be resolved by diffusing the signal in a latent quantilized spatial-temporal space. To achieve this, GLDiTalker builds upon a quantilized space-time diffusion training pipeline, which consists of a Graph Enhanced Quantilized Space Learning Stage and a Space-Time Powered Latent Diffusion Stage. The first stage ensures lip-sync accuracy, while the second stage enhances motion diversity. Together, these stages enable GLDiTalker to generate temporally and spatially stable, realistic models. Extensive evaluations on several widely used benchmarks demonstrate that our method achieves superior performance compared to existing methods.</description><identifier>DOI: 10.48550/arxiv.2408.01826</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-08</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,777,882</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2408.01826$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2408.01826$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Lin, Yihong</creatorcontrib><creatorcontrib>Fan, Zhaoxin</creatorcontrib><creatorcontrib>Xiong, Lingyu</creatorcontrib><creatorcontrib>Peng, Liang</creatorcontrib><creatorcontrib>Li, Xiandong</creatorcontrib><creatorcontrib>Kang, Wenxiong</creatorcontrib><creatorcontrib>Wu, Xianjia</creatorcontrib><creatorcontrib>Lei, Songju</creatorcontrib><creatorcontrib>Xu, Huang</creatorcontrib><title>GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer</title><description>Speech-driven talking head generation is an important but challenging task for many downstream applications such as augmented reality. Existing methods have achieved remarkable performance by utilizing autoregressive models or diffusion models. However, most still suffer from modality inconsistencies, specifically the misalignment between audio and mesh modalities, which causes inconsistencies in motion diversity and lip-sync accuracy. To address this issue, this paper introduces GLDiTalker, a novel speech-driven 3D facial animation model that employs a Graph Latent Diffusion Transformer. The core idea behind GLDiTalker is that the audio-mesh modality misalignment can be resolved by diffusing the signal in a latent quantilized spatial-temporal space. To achieve this, GLDiTalker builds upon a quantilized space-time diffusion training pipeline, which consists of a Graph Enhanced Quantilized Space Learning Stage and a Space-Time Powered Latent Diffusion Stage. The first stage ensures lip-sync accuracy, while the second stage enhances motion diversity. Together, these stages enable GLDiTalker to generate temporally and spatially stable, realistic models. Extensive evaluations on several widely used benchmarks demonstrate that our method achieves superior performance compared to existing methods.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFzrEOgkAQBNBrLIz6AVbuD4AgYIid8QQLOrEmG3IXNsJBlhP17xVibzXFTCZPiLXvuWEcRd4W-UWDuwu92PX8eLefi1uaScqxvis-wLVTqqwcyTQoA4GEBEvCGo6GGrTUGniSrSBl7CrI0CpjQZLWj37sckbT65YbxUsx01j3avXLhdgk5_x0cSZA0fH3j9_FCCkmSPB_8QGsCz13</recordid><startdate>20240803</startdate><enddate>20240803</enddate><creator>Lin, Yihong</creator><creator>Fan, Zhaoxin</creator><creator>Xiong, Lingyu</creator><creator>Peng, Liang</creator><creator>Li, Xiandong</creator><creator>Kang, Wenxiong</creator><creator>Wu, Xianjia</creator><creator>Lei, Songju</creator><creator>Xu, Huang</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240803</creationdate><title>GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer</title><author>Lin, Yihong ; Fan, Zhaoxin ; Xiong, Lingyu ; Peng, Liang ; Li, Xiandong ; Kang, Wenxiong ; Wu, Xianjia ; Lei, Songju ; Xu, Huang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2408_018263</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Lin, Yihong</creatorcontrib><creatorcontrib>Fan, Zhaoxin</creatorcontrib><creatorcontrib>Xiong, Lingyu</creatorcontrib><creatorcontrib>Peng, Liang</creatorcontrib><creatorcontrib>Li, Xiandong</creatorcontrib><creatorcontrib>Kang, Wenxiong</creatorcontrib><creatorcontrib>Wu, Xianjia</creatorcontrib><creatorcontrib>Lei, Songju</creatorcontrib><creatorcontrib>Xu, Huang</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Lin, Yihong</au><au>Fan, Zhaoxin</au><au>Xiong, Lingyu</au><au>Peng, Liang</au><au>Li, Xiandong</au><au>Kang, Wenxiong</au><au>Wu, Xianjia</au><au>Lei, Songju</au><au>Xu, Huang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer</atitle><date>2024-08-03</date><risdate>2024</risdate><abstract>Speech-driven talking head generation is an important but challenging task for many downstream applications such as augmented reality. Existing methods have achieved remarkable performance by utilizing autoregressive models or diffusion models. However, most still suffer from modality inconsistencies, specifically the misalignment between audio and mesh modalities, which causes inconsistencies in motion diversity and lip-sync accuracy. To address this issue, this paper introduces GLDiTalker, a novel speech-driven 3D facial animation model that employs a Graph Latent Diffusion Transformer. The core idea behind GLDiTalker is that the audio-mesh modality misalignment can be resolved by diffusing the signal in a latent quantilized spatial-temporal space. To achieve this, GLDiTalker builds upon a quantilized space-time diffusion training pipeline, which consists of a Graph Enhanced Quantilized Space Learning Stage and a Space-Time Powered Latent Diffusion Stage. The first stage ensures lip-sync accuracy, while the second stage enhances motion diversity. Together, these stages enable GLDiTalker to generate temporally and spatially stable, realistic models. Extensive evaluations on several widely used benchmarks demonstrate that our method achieves superior performance compared to existing methods.</abstract><doi>10.48550/arxiv.2408.01826</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2408.01826
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2408_01826
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-19T02%3A17%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=GLDiTalker:%20Speech-Driven%203D%20Facial%20Animation%20with%20Graph%20Latent%20Diffusion%20Transformer&rft.au=Lin,%20Yihong&rft.date=2024-08-03&rft_id=info:doi/10.48550/arxiv.2408.01826&rft_dat=%3Carxiv_GOX%3E2408_01826%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true