PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation
Audio-driven talking face generation is a challenging task in digital communication. Despite significant progress in the area, most existing methods concentrate on audio-lip synchronization, often overlooking aspects such as visual quality, customization, and generalization that are crucial to produ...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Audio-driven talking face generation is a challenging task in digital
communication. Despite significant progress in the area, most existing methods
concentrate on audio-lip synchronization, often overlooking aspects such as
visual quality, customization, and generalization that are crucial to producing
realistic talking faces. To address these limitations, we introduce a novel,
customizable one-shot audio-driven talking face generation framework, named
PortraitTalk. Our proposed method utilizes a latent diffusion framework
consisting of two main components: IdentityNet and AnimateNet. IdentityNet is
designed to preserve identity features consistently across the generated video
frames, while AnimateNet aims to enhance temporal coherence and motion
consistency. This framework also integrates an audio input with the reference
images, thereby reducing the reliance on reference-style videos prevalent in
existing approaches. A key innovation of PortraitTalk is the incorporation of
text prompts through decoupled cross-attention mechanisms, which significantly
expands creative control over the generated videos. Through extensive
experiments, including a newly developed evaluation metric, our model
demonstrates superior performance over the state-of-the-art methods, setting a
new standard for the generation of customizable realistic talking faces
suitable for real-world applications. |
---|---|
DOI: | 10.48550/arxiv.2412.07754 |