DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video Generation
Conversation is an essential component of virtual avatar activities in the metaverse. With the development of natural language processing, textual and vocal conversation generation has achieved a significant breakthrough. However, face-to-face conversations account for the vast majority of daily con...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Conversation is an essential component of virtual avatar activities in the
metaverse. With the development of natural language processing, textual and
vocal conversation generation has achieved a significant breakthrough. However,
face-to-face conversations account for the vast majority of daily
conversations, while most existing methods focused on single-person talking
head generation. In this work, we take a step further and consider generating
realistic face-to-face conversation videos. Conversation generation is more
challenging than single-person talking head generation, since it not only
requires generating photo-realistic individual talking heads but also demands
the listener to respond to the speaker. In this paper, we propose a novel
unified framework based on neural radiance field (NeRF) to address this task.
Specifically, we model both the speaker and listener with a NeRF framework,
with different conditions to control individual expressions. The speaker is
driven by the audio signal, while the response of the listener depends on both
visual and acoustic information. In this way, face-to-face conversation videos
are generated between human avatars, with all the interlocutors modeled within
the same network. Moreover, to facilitate future research on this task, we
collect a new human conversation dataset containing 34 clips of videos.
Quantitative and qualitative experiments evaluate our method in different
aspects, e.g., image quality, pose sequence trend, and naturalness of the
rendering videos. Experimental results demonstrate that the avatars in the
resulting videos are able to perform a realistic conversation, and maintain
individual styles. All the code, data, and models will be made publicly
available. |
---|---|
DOI: | 10.48550/arxiv.2203.07931 |