GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer
Speech-driven talking head generation is an important but challenging task for many downstream applications such as augmented reality. Existing methods have achieved remarkable performance by utilizing autoregressive models or diffusion models. However, most still suffer from modality inconsistencie...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Speech-driven talking head generation is an important but challenging task
for many downstream applications such as augmented reality. Existing methods
have achieved remarkable performance by utilizing autoregressive models or
diffusion models. However, most still suffer from modality inconsistencies,
specifically the misalignment between audio and mesh modalities, which causes
inconsistencies in motion diversity and lip-sync accuracy. To address this
issue, this paper introduces GLDiTalker, a novel speech-driven 3D facial
animation model that employs a Graph Latent Diffusion Transformer. The core
idea behind GLDiTalker is that the audio-mesh modality misalignment can be
resolved by diffusing the signal in a latent quantilized spatial-temporal
space. To achieve this, GLDiTalker builds upon a quantilized space-time
diffusion training pipeline, which consists of a Graph Enhanced Quantilized
Space Learning Stage and a Space-Time Powered Latent Diffusion Stage. The first
stage ensures lip-sync accuracy, while the second stage enhances motion
diversity. Together, these stages enable GLDiTalker to generate temporally and
spatially stable, realistic models. Extensive evaluations on several widely
used benchmarks demonstrate that our method achieves superior performance
compared to existing methods. |
---|---|
DOI: | 10.48550/arxiv.2408.01826 |