DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video Generation

Conversation is an essential component of virtual avatar activities in the metaverse. With the development of natural language processing, textual and vocal conversation generation has achieved a significant breakthrough. However, face-to-face conversations account for the vast majority of daily con...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Yan, Yichao, Zhou, Zanwei, Wang, Zi, Gao, Jingnan, Yang, Xiaokang
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Yan, Yichao Zhou, Zanwei Wang, Zi Gao, Jingnan Yang, Xiaokang
description	Conversation is an essential component of virtual avatar activities in the metaverse. With the development of natural language processing, textual and vocal conversation generation has achieved a significant breakthrough. However, face-to-face conversations account for the vast majority of daily conversations, while most existing methods focused on single-person talking head generation. In this work, we take a step further and consider generating realistic face-to-face conversation videos. Conversation generation is more challenging than single-person talking head generation, since it not only requires generating photo-realistic individual talking heads but also demands the listener to respond to the speaker. In this paper, we propose a novel unified framework based on neural radiance field (NeRF) to address this task. Specifically, we model both the speaker and listener with a NeRF framework, with different conditions to control individual expressions. The speaker is driven by the audio signal, while the response of the listener depends on both visual and acoustic information. In this way, face-to-face conversation videos are generated between human avatars, with all the interlocutors modeled within the same network. Moreover, to facilitate future research on this task, we collect a new human conversation dataset containing 34 clips of videos. Quantitative and qualitative experiments evaluate our method in different aspects, e.g., image quality, pose sequence trend, and naturalness of the rendering videos. Experimental results demonstrate that the avatars in the resulting videos are able to perform a realistic conversation, and maintain individual styles. All the code, data, and models will be made publicly available.
doi_str_mv	10.48550/arxiv.2203.07931
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2203_07931</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2203_07931</sourcerecordid><originalsourceid>FETCH-LOGICAL-a671-d2a91fe941936f556a00e04e47fc8542c691ce9908ddf6f6630e5c96ae1248f73</originalsourceid><addsrcrecordid>eNotz81KxDAYheFsXMjoBbia3EBq_tu4G6odhUFxKLMtH-kXCdRG0lr17mWqqxfO4sBDyI3gha6M4beQv-NSSMlVwUunxCV5vY8wpLdPfMZjc0fb9AW5n-gRYYjTHD3dLTBDpg14ZHNi59I6jQvmCeaYRnqKPSa6xxHzOlyRiwDDhNf_3ZC2eWjrR3Z42T_VuwMDWwrWS3AioNPCKRuMscA5co26DL4yWnrrhEfneNX3wQZrFUfjnQUUUlehVBuy_btdTd1Hju-Qf7qzrVtt6hcYJEkY</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video Generation</title><source>arXiv.org</source><creator>Yan, Yichao ; Zhou, Zanwei ; Wang, Zi ; Gao, Jingnan ; Yang, Xiaokang</creator><creatorcontrib>Yan, Yichao ; Zhou, Zanwei ; Wang, Zi ; Gao, Jingnan ; Yang, Xiaokang</creatorcontrib><description>Conversation is an essential component of virtual avatar activities in the metaverse. With the development of natural language processing, textual and vocal conversation generation has achieved a significant breakthrough. However, face-to-face conversations account for the vast majority of daily conversations, while most existing methods focused on single-person talking head generation. In this work, we take a step further and consider generating realistic face-to-face conversation videos. Conversation generation is more challenging than single-person talking head generation, since it not only requires generating photo-realistic individual talking heads but also demands the listener to respond to the speaker. In this paper, we propose a novel unified framework based on neural radiance field (NeRF) to address this task. Specifically, we model both the speaker and listener with a NeRF framework, with different conditions to control individual expressions. The speaker is driven by the audio signal, while the response of the listener depends on both visual and acoustic information. In this way, face-to-face conversation videos are generated between human avatars, with all the interlocutors modeled within the same network. Moreover, to facilitate future research on this task, we collect a new human conversation dataset containing 34 clips of videos. Quantitative and qualitative experiments evaluate our method in different aspects, e.g., image quality, pose sequence trend, and naturalness of the rendering videos. Experimental results demonstrate that the avatars in the resulting videos are able to perform a realistic conversation, and maintain individual styles. All the code, data, and models will be made publicly available.</description><identifier>DOI: 10.48550/arxiv.2203.07931</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2022-03</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2203.07931$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2203.07931$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Yan, Yichao</creatorcontrib><creatorcontrib>Zhou, Zanwei</creatorcontrib><creatorcontrib>Wang, Zi</creatorcontrib><creatorcontrib>Gao, Jingnan</creatorcontrib><creatorcontrib>Yang, Xiaokang</creatorcontrib><title>DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video Generation</title><description>Conversation is an essential component of virtual avatar activities in the metaverse. With the development of natural language processing, textual and vocal conversation generation has achieved a significant breakthrough. However, face-to-face conversations account for the vast majority of daily conversations, while most existing methods focused on single-person talking head generation. In this work, we take a step further and consider generating realistic face-to-face conversation videos. Conversation generation is more challenging than single-person talking head generation, since it not only requires generating photo-realistic individual talking heads but also demands the listener to respond to the speaker. In this paper, we propose a novel unified framework based on neural radiance field (NeRF) to address this task. Specifically, we model both the speaker and listener with a NeRF framework, with different conditions to control individual expressions. The speaker is driven by the audio signal, while the response of the listener depends on both visual and acoustic information. In this way, face-to-face conversation videos are generated between human avatars, with all the interlocutors modeled within the same network. Moreover, to facilitate future research on this task, we collect a new human conversation dataset containing 34 clips of videos. Quantitative and qualitative experiments evaluate our method in different aspects, e.g., image quality, pose sequence trend, and naturalness of the rendering videos. Experimental results demonstrate that the avatars in the resulting videos are able to perform a realistic conversation, and maintain individual styles. All the code, data, and models will be made publicly available.</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz81KxDAYheFsXMjoBbia3EBq_tu4G6odhUFxKLMtH-kXCdRG0lr17mWqqxfO4sBDyI3gha6M4beQv-NSSMlVwUunxCV5vY8wpLdPfMZjc0fb9AW5n-gRYYjTHD3dLTBDpg14ZHNi59I6jQvmCeaYRnqKPSa6xxHzOlyRiwDDhNf_3ZC2eWjrR3Z42T_VuwMDWwrWS3AioNPCKRuMscA5co26DL4yWnrrhEfneNX3wQZrFUfjnQUUUlehVBuy_btdTd1Hju-Qf7qzrVtt6hcYJEkY</recordid><startdate>20220315</startdate><enddate>20220315</enddate><creator>Yan, Yichao</creator><creator>Zhou, Zanwei</creator><creator>Wang, Zi</creator><creator>Gao, Jingnan</creator><creator>Yang, Xiaokang</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20220315</creationdate><title>DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video Generation</title><author>Yan, Yichao ; Zhou, Zanwei ; Wang, Zi ; Gao, Jingnan ; Yang, Xiaokang</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a671-d2a91fe941936f556a00e04e47fc8542c691ce9908ddf6f6630e5c96ae1248f73</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Yan, Yichao</creatorcontrib><creatorcontrib>Zhou, Zanwei</creatorcontrib><creatorcontrib>Wang, Zi</creatorcontrib><creatorcontrib>Gao, Jingnan</creatorcontrib><creatorcontrib>Yang, Xiaokang</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Yan, Yichao</au><au>Zhou, Zanwei</au><au>Wang, Zi</au><au>Gao, Jingnan</au><au>Yang, Xiaokang</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video Generation</atitle><date>2022-03-15</date><risdate>2022</risdate><abstract>Conversation is an essential component of virtual avatar activities in the metaverse. With the development of natural language processing, textual and vocal conversation generation has achieved a significant breakthrough. However, face-to-face conversations account for the vast majority of daily conversations, while most existing methods focused on single-person talking head generation. In this work, we take a step further and consider generating realistic face-to-face conversation videos. Conversation generation is more challenging than single-person talking head generation, since it not only requires generating photo-realistic individual talking heads but also demands the listener to respond to the speaker. In this paper, we propose a novel unified framework based on neural radiance field (NeRF) to address this task. Specifically, we model both the speaker and listener with a NeRF framework, with different conditions to control individual expressions. The speaker is driven by the audio signal, while the response of the listener depends on both visual and acoustic information. In this way, face-to-face conversation videos are generated between human avatars, with all the interlocutors modeled within the same network. Moreover, to facilitate future research on this task, we collect a new human conversation dataset containing 34 clips of videos. Quantitative and qualitative experiments evaluate our method in different aspects, e.g., image quality, pose sequence trend, and naturalness of the rendering videos. Experimental results demonstrate that the avatars in the resulting videos are able to perform a realistic conversation, and maintain individual styles. All the code, data, and models will be made publicly available.</abstract><doi>10.48550/arxiv.2203.07931</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2203.07931
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2203_07931
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	DialogueNeRF: Towards Realistic Avatar Face-to-Face Conversation Video Generation
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-29T07%3A14%3A43IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=DialogueNeRF:%20Towards%20Realistic%20Avatar%20Face-to-Face%20Conversation%20Video%20Generation&rft.au=Yan,%20Yichao&rft.date=2022-03-15&rft_id=info:doi/10.48550/arxiv.2203.07931&rft_dat=%3Carxiv_GOX%3E2203_07931%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true