An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniq...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2021, Vol.29, p.1368-1396
Hauptverfasser:	Michelsanti, Daniel, Tan, Zheng-Hua, Zhang, Shi-Xiong, Xu, Yong, Yu, Meng, Yu, Dong, Jensen, Jesper
Format:	Artikel
Sprache:	eng
Schlagworte:	Acoustics Audio-visual processing Deep learning Feature extraction Machine learning Microphones Separation Signal processing sound source separation Sound sources Speech Speech enhancement Speech processing speech separation speech synthesis Task analysis Videos Visual aspects Visual signals Visualization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	1396
container_issue
container_start_page	1368
container_title	IEEE/ACM transactions on audio, speech, and language processing
container_volume	29
creator	Michelsanti, Daniel Tan, Zheng-Hua Zhang, Shi-Xiong Xu, Yong Yu, Meng Yu, Dong Jensen, Jesper
description	Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning , achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features ; visual features ; deep learning methods ; fusion techniques ; training targets and objective functions . In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals , since these methods can be more or less directly applied to audio-visual speech enhancement and separation. Finally, we survey commonly employed audio-visual speech datasets , given their central role in the development of data-driven approaches, and evaluation methods , because they are generally used to compare different systems and determine their performance.
doi_str_mv	10.1109/TASLP.2021.3066303
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_9380418</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9380418</ieee_id><sourcerecordid>2515854101</sourcerecordid><originalsourceid>FETCH-LOGICAL-c339t-de60d65d73cd2851c0b883089e748ac25427ea89758641a6ba6444b4b60aa6803</originalsourceid><addsrcrecordid>eNo9kMtOwzAQRS0EElXpD8DGEuuU8SOOvQzlKUUqqIWt5SRTmqp1gpMU8fe0tLC6s7hnrnQIuWQwZgzMzTydZS9jDpyNBSglQJyQARfcREaAPP27uYFzMmrbFQAwSIxJ5IC8pp5Otxi2FX7RekHvEJsoQxd85T-iW9diSdO-rOrovWp7t6azBrFY0nu_dL7ADfqOOl_SGTYuuK6q_QU5W7h1i6NjDsnbw_188hRl08fnSZpFhRCmi0pUUKq4TERRch2zAnKtBWiDidSu4LHkCTptklgryZzKnZJS5jJX4JzSIIbk-vC3CfVnj21nV3Uf_G7S8pjFOpYM2K7FD60i1G0bcGGbUG1c-LYM7N6e_bVn9_bs0d4OujpAFSL-A0ZokEyLHwCzaWY</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2515854101</pqid></control><display><type>article</type><title>An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation</title><source>IEEE Electronic Library (IEL)</source><creator>Michelsanti, Daniel ; Tan, Zheng-Hua ; Zhang, Shi-Xiong ; Xu, Yong ; Yu, Meng ; Yu, Dong ; Jensen, Jesper</creator><creatorcontrib>Michelsanti, Daniel ; Tan, Zheng-Hua ; Zhang, Shi-Xiong ; Xu, Yong ; Yu, Meng ; Yu, Dong ; Jensen, Jesper</creatorcontrib><description>Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning , achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features ; visual features ; deep learning methods ; fusion techniques ; training targets and objective functions . In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals , since these methods can be more or less directly applied to audio-visual speech enhancement and separation. Finally, we survey commonly employed audio-visual speech datasets , given their central role in the development of data-driven approaches, and evaluation methods , because they are generally used to compare different systems and determine their performance.</description><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASLP.2021.3066303</identifier><identifier>CODEN: ITASFA</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Acoustics ; Audio-visual processing ; Deep learning ; Feature extraction ; Machine learning ; Microphones ; Separation ; Signal processing ; sound source separation ; Sound sources ; Speech ; Speech enhancement ; Speech processing ; speech separation ; speech synthesis ; Task analysis ; Videos ; Visual aspects ; Visual signals ; Visualization</subject><ispartof>IEEE/ACM transactions on audio, speech, and language processing, 2021, Vol.29, p.1368-1396</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c339t-de60d65d73cd2851c0b883089e748ac25427ea89758641a6ba6444b4b60aa6803</citedby><cites>FETCH-LOGICAL-c339t-de60d65d73cd2851c0b883089e748ac25427ea89758641a6ba6444b4b60aa6803</cites><orcidid>0000-0002-3575-1600 ; 0000-0001-6856-8928</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9380418$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,4010,27900,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9380418$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Michelsanti, Daniel</creatorcontrib><creatorcontrib>Tan, Zheng-Hua</creatorcontrib><creatorcontrib>Zhang, Shi-Xiong</creatorcontrib><creatorcontrib>Xu, Yong</creatorcontrib><creatorcontrib>Yu, Meng</creatorcontrib><creatorcontrib>Yu, Dong</creatorcontrib><creatorcontrib>Jensen, Jesper</creatorcontrib><title>An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation</title><title>IEEE/ACM transactions on audio, speech, and language processing</title><addtitle>TASLP</addtitle><description>Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning , achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features ; visual features ; deep learning methods ; fusion techniques ; training targets and objective functions . In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals , since these methods can be more or less directly applied to audio-visual speech enhancement and separation. Finally, we survey commonly employed audio-visual speech datasets , given their central role in the development of data-driven approaches, and evaluation methods , because they are generally used to compare different systems and determine their performance.</description><subject>Acoustics</subject><subject>Audio-visual processing</subject><subject>Deep learning</subject><subject>Feature extraction</subject><subject>Machine learning</subject><subject>Microphones</subject><subject>Separation</subject><subject>Signal processing</subject><subject>sound source separation</subject><subject>Sound sources</subject><subject>Speech</subject><subject>Speech enhancement</subject><subject>Speech processing</subject><subject>speech separation</subject><subject>speech synthesis</subject><subject>Task analysis</subject><subject>Videos</subject><subject>Visual aspects</subject><subject>Visual signals</subject><subject>Visualization</subject><issn>2329-9290</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kMtOwzAQRS0EElXpD8DGEuuU8SOOvQzlKUUqqIWt5SRTmqp1gpMU8fe0tLC6s7hnrnQIuWQwZgzMzTydZS9jDpyNBSglQJyQARfcREaAPP27uYFzMmrbFQAwSIxJ5IC8pp5Otxi2FX7RekHvEJsoQxd85T-iW9diSdO-rOrovWp7t6azBrFY0nu_dL7ADfqOOl_SGTYuuK6q_QU5W7h1i6NjDsnbw_188hRl08fnSZpFhRCmi0pUUKq4TERRch2zAnKtBWiDidSu4LHkCTptklgryZzKnZJS5jJX4JzSIIbk-vC3CfVnj21nV3Uf_G7S8pjFOpYM2K7FD60i1G0bcGGbUG1c-LYM7N6e_bVn9_bs0d4OujpAFSL-A0ZokEyLHwCzaWY</recordid><startdate>2021</startdate><enddate>2021</enddate><creator>Michelsanti, Daniel</creator><creator>Tan, Zheng-Hua</creator><creator>Zhang, Shi-Xiong</creator><creator>Xu, Yong</creator><creator>Yu, Meng</creator><creator>Yu, Dong</creator><creator>Jensen, Jesper</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-3575-1600</orcidid><orcidid>https://orcid.org/0000-0001-6856-8928</orcidid></search><sort><creationdate>2021</creationdate><title>An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation</title><author>Michelsanti, Daniel ; Tan, Zheng-Hua ; Zhang, Shi-Xiong ; Xu, Yong ; Yu, Meng ; Yu, Dong ; Jensen, Jesper</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c339t-de60d65d73cd2851c0b883089e748ac25427ea89758641a6ba6444b4b60aa6803</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Acoustics</topic><topic>Audio-visual processing</topic><topic>Deep learning</topic><topic>Feature extraction</topic><topic>Machine learning</topic><topic>Microphones</topic><topic>Separation</topic><topic>Signal processing</topic><topic>sound source separation</topic><topic>Sound sources</topic><topic>Speech</topic><topic>Speech enhancement</topic><topic>Speech processing</topic><topic>speech separation</topic><topic>speech synthesis</topic><topic>Task analysis</topic><topic>Videos</topic><topic>Visual aspects</topic><topic>Visual signals</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Michelsanti, Daniel</creatorcontrib><creatorcontrib>Tan, Zheng-Hua</creatorcontrib><creatorcontrib>Zhang, Shi-Xiong</creatorcontrib><creatorcontrib>Xu, Yong</creatorcontrib><creatorcontrib>Yu, Meng</creatorcontrib><creatorcontrib>Yu, Dong</creatorcontrib><creatorcontrib>Jensen, Jesper</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Michelsanti, Daniel</au><au>Tan, Zheng-Hua</au><au>Zhang, Shi-Xiong</au><au>Xu, Yong</au><au>Yu, Meng</au><au>Yu, Dong</au><au>Jensen, Jesper</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation</atitle><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle><stitle>TASLP</stitle><date>2021</date><risdate>2021</risdate><volume>29</volume><spage>1368</spage><epage>1396</epage><pages>1368-1396</pages><issn>2329-9290</issn><eissn>2329-9304</eissn><coden>ITASFA</coden><abstract>Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning , achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features ; visual features ; deep learning methods ; fusion techniques ; training targets and objective functions . In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals , since these methods can be more or less directly applied to audio-visual speech enhancement and separation. Finally, we survey commonly employed audio-visual speech datasets , given their central role in the development of data-driven approaches, and evaluation methods , because they are generally used to compare different systems and determine their performance.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TASLP.2021.3066303</doi><tpages>29</tpages><orcidid>https://orcid.org/0000-0002-3575-1600</orcidid><orcidid>https://orcid.org/0000-0001-6856-8928</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 2329-9290
ispartof	IEEE/ACM transactions on audio, speech, and language processing, 2021, Vol.29, p.1368-1396
issn	2329-9290 2329-9304
language	eng
recordid	cdi_ieee_primary_9380418
source	IEEE Electronic Library (IEL)
subjects	Acoustics Audio-visual processing Deep learning Feature extraction Machine learning Microphones Separation Signal processing sound source separation Sound sources Speech Speech enhancement Speech processing speech separation speech synthesis Task analysis Videos Visual aspects Visual signals Visualization
title	An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-31T18%3A46%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=An%20Overview%20of%20Deep-Learning-Based%20Audio-Visual%20Speech%20Enhancement%20and%20Separation&rft.jtitle=IEEE/ACM%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Michelsanti,%20Daniel&rft.date=2021&rft.volume=29&rft.spage=1368&rft.epage=1396&rft.pages=1368-1396&rft.issn=2329-9290&rft.eissn=2329-9304&rft.coden=ITASFA&rft_id=info:doi/10.1109/TASLP.2021.3066303&rft_dat=%3Cproquest_RIE%3E2515854101%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2515854101&rft_id=info:pmid/&rft_ieee_id=9380418&rfr_iscdi=true