MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation

Recent advances in video diffusion models have unlocked new potential for realistic audio-driven talking video generation. However, achieving seamless audio-lip synchronization, maintaining long-term identity consistency, and producing natural, audio-aligned expressions in generated talking videos r...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2024-12
Hauptverfasser:	Zheng, Longtao, Zhang, Yifan, Guo, Hanzhong, Pan, Jiachun, Tan, Zhenxiong, Lu, Jiahao, Tang, Chuanxin, An, Bo, Shuicheng Yan
Format:	Artikel
Sprache:	eng
Schlagworte:	Animation Attention Audio data Emotions Image enhancement Image quality Modules Smoothness Synchronism Talking Video
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Zheng, Longtao Zhang, Yifan Guo, Hanzhong Pan, Jiachun Tan, Zhenxiong Lu, Jiahao Tang, Chuanxin An, Bo Shuicheng Yan
description	Recent advances in video diffusion models have unlocked new potential for realistic audio-driven talking video generation. However, achieving seamless audio-lip synchronization, maintaining long-term identity consistency, and producing natural, audio-aligned expressions in generated talking videos remain significant challenges. To address these challenges, we propose Memory-guided EMOtion-aware diffusion (MEMO), an end-to-end audio-driven portrait animation approach to generate identity-consistent and expressive talking videos. Our approach is built around two key modules: (1) a memory-guided temporal module, which enhances long-term identity consistency and motion smoothness by developing memory states to store information from a longer past context to guide temporal modeling via linear attention; and (2) an emotion-aware audio module, which replaces traditional cross attention with multi-modal attention to enhance audio-video interaction, while detecting emotions from audio to refine facial expressions via emotion adaptive layer norm. Extensive quantitative and qualitative results demonstrate that MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_3141682516</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>3141682516</sourcerecordid><originalsourceid>FETCH-proquest_journals_31416825163</originalsourceid><addsrcrecordid>eNqNykELgjAYgOERBEn5Hz7oLLhNTbrWsot0ka4y8FvMbLPNRf37PPQDOr2H91mQiHFOkzJjbEVi7_s0TVmxY3nOIyJqUV_2UOPDuk9SBd1hB0etVPDaGlDWgXiPDr3XL4RGDndtbnCdmYUKDTo5zW5DlkoOHuNf12R7Es3hnIzOPgP6qe1tcGZeLacZLUqW04L_p74SKDo9</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>3141682516</pqid></control><display><type>article</type><title>MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation</title><source>Open Access: Freely Accessible Journals by multiple vendors</source><creator>Zheng, Longtao ; Zhang, Yifan ; Guo, Hanzhong ; Pan, Jiachun ; Tan, Zhenxiong ; Lu, Jiahao ; Tang, Chuanxin ; An, Bo ; Shuicheng Yan</creator><creatorcontrib>Zheng, Longtao ; Zhang, Yifan ; Guo, Hanzhong ; Pan, Jiachun ; Tan, Zhenxiong ; Lu, Jiahao ; Tang, Chuanxin ; An, Bo ; Shuicheng Yan</creatorcontrib><description>Recent advances in video diffusion models have unlocked new potential for realistic audio-driven talking video generation. However, achieving seamless audio-lip synchronization, maintaining long-term identity consistency, and producing natural, audio-aligned expressions in generated talking videos remain significant challenges. To address these challenges, we propose Memory-guided EMOtion-aware diffusion (MEMO), an end-to-end audio-driven portrait animation approach to generate identity-consistent and expressive talking videos. Our approach is built around two key modules: (1) a memory-guided temporal module, which enhances long-term identity consistency and motion smoothness by developing memory states to store information from a longer past context to guide temporal modeling via linear attention; and (2) an emotion-aware audio module, which replaces traditional cross attention with multi-modal attention to enhance audio-video interaction, while detecting emotions from audio to refine facial expressions via emotion adaptive layer norm. Extensive quantitative and qualitative results demonstrate that MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Animation ; Attention ; Audio data ; Emotions ; Image enhancement ; Image quality ; Modules ; Smoothness ; Synchronism ; Talking ; Video</subject><ispartof>arXiv.org, 2024-12</ispartof><rights>2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>781,785</link.rule.ids></links><search><creatorcontrib>Zheng, Longtao</creatorcontrib><creatorcontrib>Zhang, Yifan</creatorcontrib><creatorcontrib>Guo, Hanzhong</creatorcontrib><creatorcontrib>Pan, Jiachun</creatorcontrib><creatorcontrib>Tan, Zhenxiong</creatorcontrib><creatorcontrib>Lu, Jiahao</creatorcontrib><creatorcontrib>Tang, Chuanxin</creatorcontrib><creatorcontrib>An, Bo</creatorcontrib><creatorcontrib>Shuicheng Yan</creatorcontrib><title>MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation</title><title>arXiv.org</title><description>Recent advances in video diffusion models have unlocked new potential for realistic audio-driven talking video generation. However, achieving seamless audio-lip synchronization, maintaining long-term identity consistency, and producing natural, audio-aligned expressions in generated talking videos remain significant challenges. To address these challenges, we propose Memory-guided EMOtion-aware diffusion (MEMO), an end-to-end audio-driven portrait animation approach to generate identity-consistent and expressive talking videos. Our approach is built around two key modules: (1) a memory-guided temporal module, which enhances long-term identity consistency and motion smoothness by developing memory states to store information from a longer past context to guide temporal modeling via linear attention; and (2) an emotion-aware audio module, which replaces traditional cross attention with multi-modal attention to enhance audio-video interaction, while detecting emotions from audio to refine facial expressions via emotion adaptive layer norm. Extensive quantitative and qualitative results demonstrate that MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.</description><subject>Animation</subject><subject>Attention</subject><subject>Audio data</subject><subject>Emotions</subject><subject>Image enhancement</subject><subject>Image quality</subject><subject>Modules</subject><subject>Smoothness</subject><subject>Synchronism</subject><subject>Talking</subject><subject>Video</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNykELgjAYgOERBEn5Hz7oLLhNTbrWsot0ka4y8FvMbLPNRf37PPQDOr2H91mQiHFOkzJjbEVi7_s0TVmxY3nOIyJqUV_2UOPDuk9SBd1hB0etVPDaGlDWgXiPDr3XL4RGDndtbnCdmYUKDTo5zW5DlkoOHuNf12R7Es3hnIzOPgP6qe1tcGZeLacZLUqW04L_p74SKDo9</recordid><startdate>20241205</startdate><enddate>20241205</enddate><creator>Zheng, Longtao</creator><creator>Zhang, Yifan</creator><creator>Guo, Hanzhong</creator><creator>Pan, Jiachun</creator><creator>Tan, Zhenxiong</creator><creator>Lu, Jiahao</creator><creator>Tang, Chuanxin</creator><creator>An, Bo</creator><creator>Shuicheng Yan</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20241205</creationdate><title>MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation</title><author>Zheng, Longtao ; Zhang, Yifan ; Guo, Hanzhong ; Pan, Jiachun ; Tan, Zhenxiong ; Lu, Jiahao ; Tang, Chuanxin ; An, Bo ; Shuicheng Yan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_31416825163</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Animation</topic><topic>Attention</topic><topic>Audio data</topic><topic>Emotions</topic><topic>Image enhancement</topic><topic>Image quality</topic><topic>Modules</topic><topic>Smoothness</topic><topic>Synchronism</topic><topic>Talking</topic><topic>Video</topic><toplevel>online_resources</toplevel><creatorcontrib>Zheng, Longtao</creatorcontrib><creatorcontrib>Zhang, Yifan</creatorcontrib><creatorcontrib>Guo, Hanzhong</creatorcontrib><creatorcontrib>Pan, Jiachun</creatorcontrib><creatorcontrib>Tan, Zhenxiong</creatorcontrib><creatorcontrib>Lu, Jiahao</creatorcontrib><creatorcontrib>Tang, Chuanxin</creatorcontrib><creatorcontrib>An, Bo</creatorcontrib><creatorcontrib>Shuicheng Yan</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zheng, Longtao</au><au>Zhang, Yifan</au><au>Guo, Hanzhong</au><au>Pan, Jiachun</au><au>Tan, Zhenxiong</au><au>Lu, Jiahao</au><au>Tang, Chuanxin</au><au>An, Bo</au><au>Shuicheng Yan</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation</atitle><jtitle>arXiv.org</jtitle><date>2024-12-05</date><risdate>2024</risdate><eissn>2331-8422</eissn><abstract>Recent advances in video diffusion models have unlocked new potential for realistic audio-driven talking video generation. However, achieving seamless audio-lip synchronization, maintaining long-term identity consistency, and producing natural, audio-aligned expressions in generated talking videos remain significant challenges. To address these challenges, we propose Memory-guided EMOtion-aware diffusion (MEMO), an end-to-end audio-driven portrait animation approach to generate identity-consistent and expressive talking videos. Our approach is built around two key modules: (1) a memory-guided temporal module, which enhances long-term identity consistency and motion smoothness by developing memory states to store information from a longer past context to guide temporal modeling via linear attention; and (2) an emotion-aware audio module, which replaces traditional cross attention with multi-modal attention to enhance audio-video interaction, while detecting emotions from audio to refine facial expressions via emotion adaptive layer norm. Extensive quantitative and qualitative results demonstrate that MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2024-12
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_3141682516
source	Open Access: Freely Accessible Journals by multiple vendors
subjects	Animation Attention Audio data Emotions Image enhancement Image quality Modules Smoothness Synchronism Talking Video
title	MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-17T17%3A48%3A12IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=MEMO:%20Memory-Guided%20Diffusion%20for%20Expressive%20Talking%20Video%20Generation&rft.jtitle=arXiv.org&rft.au=Zheng,%20Longtao&rft.date=2024-12-05&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E3141682516%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=3141682516&rft_id=info:pmid/&rfr_iscdi=true