SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis

Achieving high synchronization in the synthesis of realistic, speech-driven talking head videos presents a significant challenge. Traditional Generative Adversarial Networks (GAN) struggle to maintain consistent facial identity, while Neural Radiance Fields (NeRF) methods, although they can address...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Peng, Ziqiao, Hu, Wentao, Shi, Yue, Zhu, Xiangyu, Zhang, Xiaomei, Zhao, Hao, He, Jun, Liu, Hongyan, Fan, Zhaoxin
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computer Vision and Pattern Recognition
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Peng, Ziqiao Hu, Wentao Shi, Yue Zhu, Xiangyu Zhang, Xiaomei Zhao, Hao He, Jun Liu, Hongyan Fan, Zhaoxin
description	Achieving high synchronization in the synthesis of realistic, speech-driven talking head videos presents a significant challenge. Traditional Generative Adversarial Networks (GAN) struggle to maintain consistent facial identity, while Neural Radiance Fields (NeRF) methods, although they can address this issue, often produce mismatched lip movements, inadequate facial expressions, and unstable head poses. A lifelike talking head requires synchronized coordination of subject identity, lip movements, facial expressions, and head poses. The absence of these synchronizations is a fundamental flaw, leading to unrealistic and artificial outcomes. To address the critical issue of synchronization, identified as the "devil" in creating realistic talking heads, we introduce SyncTalk. This NeRF-based method effectively maintains subject identity, enhancing synchronization and realism in talking head synthesis. SyncTalk employs a Face-Sync Controller to align lip movements with speech and innovatively uses a 3D facial blendshape model to capture accurate facial expressions. Our Head-Sync Stabilizer optimizes head poses, achieving more natural head movements. The Portrait-Sync Generator restores hair details and blends the generated head with the torso for a seamless visual experience. Extensive experiments and user studies demonstrate that SyncTalk outperforms state-of-the-art methods in synchronization and realism. We recommend watching the supplementary video: https://ziqiaopeng.github.io/synctalk
doi_str_mv	10.48550/arxiv.2311.17590
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2311_17590</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2311_17590</sourcerecordid><originalsourceid>FETCH-LOGICAL-a670-50dc5827ba9c8433a09daf14679ac2145569894a370edf3b67715a86f98c0a3a3</originalsourceid><addsrcrecordid>eNotj8tuwjAURL3poqL9gK7qH0hqx292iD6ChNRFs48ujl2uCE7lIFT69SXAajSao5EOIU-cldIqxV4g_-KxrATnJTfKsXtSf52Sb6DfzWmzDfQ1HLGnOFJM9HDu07rNQ8I_OOCQaBwynWhM37QO0E3AmRtxfCB3EfoxPN5yRpr3t2ZZF-vPj9VysS5AG1Yo1nllK7MB560UApjrIHKpjQNfcamUdtZJEIaFLoqNNoYrsDo66xkIEDPyfL29qLQ_GfeQT-2k1F6UxD8I4kWR</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis</title><source>arXiv.org</source><creator>Peng, Ziqiao ; Hu, Wentao ; Shi, Yue ; Zhu, Xiangyu ; Zhang, Xiaomei ; Zhao, Hao ; He, Jun ; Liu, Hongyan ; Fan, Zhaoxin</creator><creatorcontrib>Peng, Ziqiao ; Hu, Wentao ; Shi, Yue ; Zhu, Xiangyu ; Zhang, Xiaomei ; Zhao, Hao ; He, Jun ; Liu, Hongyan ; Fan, Zhaoxin</creatorcontrib><description>Achieving high synchronization in the synthesis of realistic, speech-driven talking head videos presents a significant challenge. Traditional Generative Adversarial Networks (GAN) struggle to maintain consistent facial identity, while Neural Radiance Fields (NeRF) methods, although they can address this issue, often produce mismatched lip movements, inadequate facial expressions, and unstable head poses. A lifelike talking head requires synchronized coordination of subject identity, lip movements, facial expressions, and head poses. The absence of these synchronizations is a fundamental flaw, leading to unrealistic and artificial outcomes. To address the critical issue of synchronization, identified as the "devil" in creating realistic talking heads, we introduce SyncTalk. This NeRF-based method effectively maintains subject identity, enhancing synchronization and realism in talking head synthesis. SyncTalk employs a Face-Sync Controller to align lip movements with speech and innovatively uses a 3D facial blendshape model to capture accurate facial expressions. Our Head-Sync Stabilizer optimizes head poses, achieving more natural head movements. The Portrait-Sync Generator restores hair details and blends the generated head with the torso for a seamless visual experience. Extensive experiments and user studies demonstrate that SyncTalk outperforms state-of-the-art methods in synchronization and realism. We recommend watching the supplementary video: https://ziqiaopeng.github.io/synctalk</description><identifier>DOI: 10.48550/arxiv.2311.17590</identifier><language>eng</language><subject>Computer Science - Computer Vision and Pattern Recognition</subject><creationdate>2023-11</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2311.17590$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2311.17590$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Peng, Ziqiao</creatorcontrib><creatorcontrib>Hu, Wentao</creatorcontrib><creatorcontrib>Shi, Yue</creatorcontrib><creatorcontrib>Zhu, Xiangyu</creatorcontrib><creatorcontrib>Zhang, Xiaomei</creatorcontrib><creatorcontrib>Zhao, Hao</creatorcontrib><creatorcontrib>He, Jun</creatorcontrib><creatorcontrib>Liu, Hongyan</creatorcontrib><creatorcontrib>Fan, Zhaoxin</creatorcontrib><title>SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis</title><description>Achieving high synchronization in the synthesis of realistic, speech-driven talking head videos presents a significant challenge. Traditional Generative Adversarial Networks (GAN) struggle to maintain consistent facial identity, while Neural Radiance Fields (NeRF) methods, although they can address this issue, often produce mismatched lip movements, inadequate facial expressions, and unstable head poses. A lifelike talking head requires synchronized coordination of subject identity, lip movements, facial expressions, and head poses. The absence of these synchronizations is a fundamental flaw, leading to unrealistic and artificial outcomes. To address the critical issue of synchronization, identified as the "devil" in creating realistic talking heads, we introduce SyncTalk. This NeRF-based method effectively maintains subject identity, enhancing synchronization and realism in talking head synthesis. SyncTalk employs a Face-Sync Controller to align lip movements with speech and innovatively uses a 3D facial blendshape model to capture accurate facial expressions. Our Head-Sync Stabilizer optimizes head poses, achieving more natural head movements. The Portrait-Sync Generator restores hair details and blends the generated head with the torso for a seamless visual experience. Extensive experiments and user studies demonstrate that SyncTalk outperforms state-of-the-art methods in synchronization and realism. We recommend watching the supplementary video: https://ziqiaopeng.github.io/synctalk</description><subject>Computer Science - Computer Vision and Pattern Recognition</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tuwjAURL3poqL9gK7qH0hqx292iD6ChNRFs48ujl2uCE7lIFT69SXAajSao5EOIU-cldIqxV4g_-KxrATnJTfKsXtSf52Sb6DfzWmzDfQ1HLGnOFJM9HDu07rNQ8I_OOCQaBwynWhM37QO0E3AmRtxfCB3EfoxPN5yRpr3t2ZZF-vPj9VysS5AG1Yo1nllK7MB560UApjrIHKpjQNfcamUdtZJEIaFLoqNNoYrsDo66xkIEDPyfL29qLQ_GfeQT-2k1F6UxD8I4kWR</recordid><startdate>20231129</startdate><enddate>20231129</enddate><creator>Peng, Ziqiao</creator><creator>Hu, Wentao</creator><creator>Shi, Yue</creator><creator>Zhu, Xiangyu</creator><creator>Zhang, Xiaomei</creator><creator>Zhao, Hao</creator><creator>He, Jun</creator><creator>Liu, Hongyan</creator><creator>Fan, Zhaoxin</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20231129</creationdate><title>SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis</title><author>Peng, Ziqiao ; Hu, Wentao ; Shi, Yue ; Zhu, Xiangyu ; Zhang, Xiaomei ; Zhao, Hao ; He, Jun ; Liu, Hongyan ; Fan, Zhaoxin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a670-50dc5827ba9c8433a09daf14679ac2145569894a370edf3b67715a86f98c0a3a3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Computer Vision and Pattern Recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Peng, Ziqiao</creatorcontrib><creatorcontrib>Hu, Wentao</creatorcontrib><creatorcontrib>Shi, Yue</creatorcontrib><creatorcontrib>Zhu, Xiangyu</creatorcontrib><creatorcontrib>Zhang, Xiaomei</creatorcontrib><creatorcontrib>Zhao, Hao</creatorcontrib><creatorcontrib>He, Jun</creatorcontrib><creatorcontrib>Liu, Hongyan</creatorcontrib><creatorcontrib>Fan, Zhaoxin</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Peng, Ziqiao</au><au>Hu, Wentao</au><au>Shi, Yue</au><au>Zhu, Xiangyu</au><au>Zhang, Xiaomei</au><au>Zhao, Hao</au><au>He, Jun</au><au>Liu, Hongyan</au><au>Fan, Zhaoxin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis</atitle><date>2023-11-29</date><risdate>2023</risdate><abstract>Achieving high synchronization in the synthesis of realistic, speech-driven talking head videos presents a significant challenge. Traditional Generative Adversarial Networks (GAN) struggle to maintain consistent facial identity, while Neural Radiance Fields (NeRF) methods, although they can address this issue, often produce mismatched lip movements, inadequate facial expressions, and unstable head poses. A lifelike talking head requires synchronized coordination of subject identity, lip movements, facial expressions, and head poses. The absence of these synchronizations is a fundamental flaw, leading to unrealistic and artificial outcomes. To address the critical issue of synchronization, identified as the "devil" in creating realistic talking heads, we introduce SyncTalk. This NeRF-based method effectively maintains subject identity, enhancing synchronization and realism in talking head synthesis. SyncTalk employs a Face-Sync Controller to align lip movements with speech and innovatively uses a 3D facial blendshape model to capture accurate facial expressions. Our Head-Sync Stabilizer optimizes head poses, achieving more natural head movements. The Portrait-Sync Generator restores hair details and blends the generated head with the torso for a seamless visual experience. Extensive experiments and user studies demonstrate that SyncTalk outperforms state-of-the-art methods in synchronization and realism. We recommend watching the supplementary video: https://ziqiaopeng.github.io/synctalk</abstract><doi>10.48550/arxiv.2311.17590</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2311.17590
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2311_17590
source	arXiv.org
subjects	Computer Science - Computer Vision and Pattern Recognition
title	SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-09T17%3A37%3A00IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=SyncTalk:%20The%20Devil%20is%20in%20the%20Synchronization%20for%20Talking%20Head%20Synthesis&rft.au=Peng,%20Ziqiao&rft.date=2023-11-29&rft_id=info:doi/10.48550/arxiv.2311.17590&rft_dat=%3Carxiv_GOX%3E2311_17590%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true