GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

Speech-driven talking head generation is an important but challenging task for many downstream applications such as augmented reality. Existing methods have achieved remarkable performance by utilizing autoregressive models or diffusion models. However, most still suffer from modality inconsistencie...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Lin, Yihong, Fan, Zhaoxin, Xiong, Lingyu, Peng, Liang, Li, Xiandong, Kang, Wenxiong, Wu, Xianjia, Lei, Songju, Xu, Huang
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Lin, Yihong
Fan, Zhaoxin
Xiong, Lingyu
Peng, Liang
Li, Xiandong
Kang, Wenxiong
Wu, Xianjia
Lei, Songju
Xu, Huang
description Speech-driven talking head generation is an important but challenging task for many downstream applications such as augmented reality. Existing methods have achieved remarkable performance by utilizing autoregressive models or diffusion models. However, most still suffer from modality inconsistencies, specifically the misalignment between audio and mesh modalities, which causes inconsistencies in motion diversity and lip-sync accuracy. To address this issue, this paper introduces GLDiTalker, a novel speech-driven 3D facial animation model that employs a Graph Latent Diffusion Transformer. The core idea behind GLDiTalker is that the audio-mesh modality misalignment can be resolved by diffusing the signal in a latent quantilized spatial-temporal space. To achieve this, GLDiTalker builds upon a quantilized space-time diffusion training pipeline, which consists of a Graph Enhanced Quantilized Space Learning Stage and a Space-Time Powered Latent Diffusion Stage. The first stage ensures lip-sync accuracy, while the second stage enhances motion diversity. Together, these stages enable GLDiTalker to generate temporally and spatially stable, realistic models. Extensive evaluations on several widely used benchmarks demonstrate that our method achieves superior performance compared to existing methods.
doi_str_mv 10.48550/arxiv.2408.01826
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2408_01826</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2408_01826</sourcerecordid><originalsourceid>FETCH-arxiv_primary_2408_018263</originalsourceid><addsrcrecordid>eNqFzrEOgkAQBNBrLIz6AVbuD4AgYIid8QQLOrEmG3IXNsJBlhP17xVibzXFTCZPiLXvuWEcRd4W-UWDuwu92PX8eLefi1uaScqxvis-wLVTqqwcyTQoA4GEBEvCGo6GGrTUGniSrSBl7CrI0CpjQZLWj37sckbT65YbxUsx01j3avXLhdgk5_x0cSZA0fH3j9_FCCkmSPB_8QGsCz13</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer</title><source>arXiv.org</source><creator>Lin, Yihong ; Fan, Zhaoxin ; Xiong, Lingyu ; Peng, Liang ; Li, Xiandong ; Kang, Wenxiong ; Wu, Xianjia ; Lei, Songju ; Xu, Huang</creator><creatorcontrib>Lin, Yihong ; Fan, Zhaoxin ; Xiong, Lingyu ; Peng, Liang ; Li, Xiandong ; Kang, Wenxiong ; Wu, Xianjia ; Lei, Songju ; Xu, Huang</creatorcontrib><description>Speech-driven talking head generation is an important but challenging task for many downstream applications such as augmented reality. Existing methods have achieved remarkable performance by utilizing autoregressive models or diffusion models. However, most still suffer from modality inconsistencies, specifically the misalignment between audio and mesh modalities, which causes inconsistencies in motion diversity and lip-sync accuracy. To address this issue, this paper introduces GLDiTalker, a novel speech-driven 3D facial animation model that employs a Graph Latent Diffusion Transformer. The core idea behind GLDiTalker is that the audio-mesh modality misalignment can be resolved by diffusing the signal in a latent quantilized spatial-temporal space. To achieve this, GLDiTalker builds upon a quantilized space-time diffusion training pipeline, which consists of a Graph Enhanced Quantilized Space Learning Stage and a Space-Time Powered Latent Diffusion Stage. The first stage ensures lip-sync accuracy, while the second stage enhances motion diversity. Together, these stages enable GLDiTalker to generate temporally and spatially stable, realistic models. Extensive evaluations on several widely used benchmarks demonstrate that our method achieves superior performance compared to existing methods.</description><identifier>DOI: 10.48550/arxiv.2408.01826</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2024-08</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,777,882</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2408.01826$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2408.01826$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Lin, Yihong</creatorcontrib><creatorcontrib>Fan, Zhaoxin</creatorcontrib><creatorcontrib>Xiong, Lingyu</creatorcontrib><creatorcontrib>Peng, Liang</creatorcontrib><creatorcontrib>Li, Xiandong</creatorcontrib><creatorcontrib>Kang, Wenxiong</creatorcontrib><creatorcontrib>Wu, Xianjia</creatorcontrib><creatorcontrib>Lei, Songju</creatorcontrib><creatorcontrib>Xu, Huang</creatorcontrib><title>GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer</title><description>Speech-driven talking head generation is an important but challenging task for many downstream applications such as augmented reality. Existing methods have achieved remarkable performance by utilizing autoregressive models or diffusion models. However, most still suffer from modality inconsistencies, specifically the misalignment between audio and mesh modalities, which causes inconsistencies in motion diversity and lip-sync accuracy. To address this issue, this paper introduces GLDiTalker, a novel speech-driven 3D facial animation model that employs a Graph Latent Diffusion Transformer. The core idea behind GLDiTalker is that the audio-mesh modality misalignment can be resolved by diffusing the signal in a latent quantilized spatial-temporal space. To achieve this, GLDiTalker builds upon a quantilized space-time diffusion training pipeline, which consists of a Graph Enhanced Quantilized Space Learning Stage and a Space-Time Powered Latent Diffusion Stage. The first stage ensures lip-sync accuracy, while the second stage enhances motion diversity. Together, these stages enable GLDiTalker to generate temporally and spatially stable, realistic models. Extensive evaluations on several widely used benchmarks demonstrate that our method achieves superior performance compared to existing methods.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNqFzrEOgkAQBNBrLIz6AVbuD4AgYIid8QQLOrEmG3IXNsJBlhP17xVibzXFTCZPiLXvuWEcRd4W-UWDuwu92PX8eLefi1uaScqxvis-wLVTqqwcyTQoA4GEBEvCGo6GGrTUGniSrSBl7CrI0CpjQZLWj37sckbT65YbxUsx01j3avXLhdgk5_x0cSZA0fH3j9_FCCkmSPB_8QGsCz13</recordid><startdate>20240803</startdate><enddate>20240803</enddate><creator>Lin, Yihong</creator><creator>Fan, Zhaoxin</creator><creator>Xiong, Lingyu</creator><creator>Peng, Liang</creator><creator>Li, Xiandong</creator><creator>Kang, Wenxiong</creator><creator>Wu, Xianjia</creator><creator>Lei, Songju</creator><creator>Xu, Huang</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240803</creationdate><title>GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer</title><author>Lin, Yihong ; Fan, Zhaoxin ; Xiong, Lingyu ; Peng, Liang ; Li, Xiandong ; Kang, Wenxiong ; Wu, Xianjia ; Lei, Songju ; Xu, Huang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-arxiv_primary_2408_018263</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Lin, Yihong</creatorcontrib><creatorcontrib>Fan, Zhaoxin</creatorcontrib><creatorcontrib>Xiong, Lingyu</creatorcontrib><creatorcontrib>Peng, Liang</creatorcontrib><creatorcontrib>Li, Xiandong</creatorcontrib><creatorcontrib>Kang, Wenxiong</creatorcontrib><creatorcontrib>Wu, Xianjia</creatorcontrib><creatorcontrib>Lei, Songju</creatorcontrib><creatorcontrib>Xu, Huang</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Lin, Yihong</au><au>Fan, Zhaoxin</au><au>Xiong, Lingyu</au><au>Peng, Liang</au><au>Li, Xiandong</au><au>Kang, Wenxiong</au><au>Wu, Xianjia</au><au>Lei, Songju</au><au>Xu, Huang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer</atitle><date>2024-08-03</date><risdate>2024</risdate><abstract>Speech-driven talking head generation is an important but challenging task for many downstream applications such as augmented reality. Existing methods have achieved remarkable performance by utilizing autoregressive models or diffusion models. However, most still suffer from modality inconsistencies, specifically the misalignment between audio and mesh modalities, which causes inconsistencies in motion diversity and lip-sync accuracy. To address this issue, this paper introduces GLDiTalker, a novel speech-driven 3D facial animation model that employs a Graph Latent Diffusion Transformer. The core idea behind GLDiTalker is that the audio-mesh modality misalignment can be resolved by diffusing the signal in a latent quantilized spatial-temporal space. To achieve this, GLDiTalker builds upon a quantilized space-time diffusion training pipeline, which consists of a Graph Enhanced Quantilized Space Learning Stage and a Space-Time Powered Latent Diffusion Stage. The first stage ensures lip-sync accuracy, while the second stage enhances motion diversity. Together, these stages enable GLDiTalker to generate temporally and spatially stable, realistic models. Extensive evaluations on several widely used benchmarks demonstrate that our method achieves superior performance compared to existing methods.</abstract><doi>10.48550/arxiv.2408.01826</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2408.01826
ispartof
issn
language eng
recordid cdi_arxiv_primary_2408_01826
source arXiv.org
subjects Computer Science - Computer Vision and Pattern Recognition
title GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-19T02%3A17%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=GLDiTalker:%20Speech-Driven%203D%20Facial%20Animation%20with%20Graph%20Latent%20Diffusion%20Transformer&rft.au=Lin,%20Yihong&rft.date=2024-08-03&rft_id=info:doi/10.48550/arxiv.2408.01826&rft_dat=%3Carxiv_GOX%3E2408_01826%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true