GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

Speech-driven talking head generation is an important but challenging task for many downstream applications such as augmented reality. Existing methods have achieved remarkable performance by utilizing autoregressive models or diffusion models. However, most still suffer from modality inconsistencie...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Lin, Yihong, Fan, Zhaoxin, Xiong, Lingyu, Peng, Liang, Li, Xiandong, Kang, Wenxiong, Wu, Xianjia, Lei, Songju, Xu, Huang
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Speech-driven talking head generation is an important but challenging task for many downstream applications such as augmented reality. Existing methods have achieved remarkable performance by utilizing autoregressive models or diffusion models. However, most still suffer from modality inconsistencies, specifically the misalignment between audio and mesh modalities, which causes inconsistencies in motion diversity and lip-sync accuracy. To address this issue, this paper introduces GLDiTalker, a novel speech-driven 3D facial animation model that employs a Graph Latent Diffusion Transformer. The core idea behind GLDiTalker is that the audio-mesh modality misalignment can be resolved by diffusing the signal in a latent quantilized spatial-temporal space. To achieve this, GLDiTalker builds upon a quantilized space-time diffusion training pipeline, which consists of a Graph Enhanced Quantilized Space Learning Stage and a Space-Time Powered Latent Diffusion Stage. The first stage ensures lip-sync accuracy, while the second stage enhances motion diversity. Together, these stages enable GLDiTalker to generate temporally and spatially stable, realistic models. Extensive evaluations on several widely used benchmarks demonstrate that our method achieves superior performance compared to existing methods.
DOI:10.48550/arxiv.2408.01826