MoEE: Mixture of Emotion Experts for Audio-Driven Portrait Animation
The generation of talking avatars has achieved significant advancements in precise audio synchronization. However, crafting lifelike talking head videos requires capturing a broad spectrum of emotions and subtle facial expressions. Current methods face fundamental challenges: a) the absence of frame...
Gespeichert in:
Hauptverfasser: | , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | The generation of talking avatars has achieved significant advancements in
precise audio synchronization. However, crafting lifelike talking head videos
requires capturing a broad spectrum of emotions and subtle facial expressions.
Current methods face fundamental challenges: a) the absence of frameworks for
modeling single basic emotional expressions, which restricts the generation of
complex emotions such as compound emotions; b) the lack of comprehensive
datasets rich in human emotional expressions, which limits the potential of
models. To address these challenges, we propose the following innovations: 1)
the Mixture of Emotion Experts (MoEE) model, which decouples six fundamental
emotions to enable the precise synthesis of both singular and compound
emotional states; 2) the DH-FaceEmoVid-150 dataset, specifically curated to
include six prevalent human emotional expressions as well as four types of
compound emotions, thereby expanding the training potential of emotion-driven
models. Furthermore, to enhance the flexibility of emotion control, we propose
an emotion-to-latents module that leverages multimodal inputs, aligning diverse
control signals-such as audio, text, and labels-to ensure more varied control
inputs as well as the ability to control emotions using audio alone. Through
extensive quantitative and qualitative evaluations, we demonstrate that the
MoEE framework, in conjunction with the DH-FaceEmoVid-150 dataset, excels in
generating complex emotional expressions and nuanced facial details, setting a
new benchmark in the field. These datasets will be publicly released. |
---|---|
DOI: | 10.48550/arxiv.2501.01808 |