An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniq...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2021, Vol.29, p.1368-1396
Hauptverfasser: Michelsanti, Daniel, Tan, Zheng-Hua, Zhang, Shi-Xiong, Xu, Yong, Yu, Meng, Yu, Dong, Jensen, Jesper
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1396
container_issue
container_start_page 1368
container_title IEEE/ACM transactions on audio, speech, and language processing
container_volume 29
creator Michelsanti, Daniel
Tan, Zheng-Hua
Zhang, Shi-Xiong
Xu, Yong
Yu, Meng
Yu, Dong
Jensen, Jesper
description Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning , achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features ; visual features ; deep learning methods ; fusion techniques ; training targets and objective functions . In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals , since these methods can be more or less directly applied to audio-visual speech enhancement and separation. Finally, we survey commonly employed audio-visual speech datasets , given their central role in the development of data-driven approaches, and evaluation methods , because they are generally used to compare different systems and determine their performance.
doi_str_mv 10.1109/TASLP.2021.3066303
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_9380418</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9380418</ieee_id><sourcerecordid>2515854101</sourcerecordid><originalsourceid>FETCH-LOGICAL-c339t-de60d65d73cd2851c0b883089e748ac25427ea89758641a6ba6444b4b60aa6803</originalsourceid><addsrcrecordid>eNo9kMtOwzAQRS0EElXpD8DGEuuU8SOOvQzlKUUqqIWt5SRTmqp1gpMU8fe0tLC6s7hnrnQIuWQwZgzMzTydZS9jDpyNBSglQJyQARfcREaAPP27uYFzMmrbFQAwSIxJ5IC8pp5Otxi2FX7RekHvEJsoQxd85T-iW9diSdO-rOrovWp7t6azBrFY0nu_dL7ADfqOOl_SGTYuuK6q_QU5W7h1i6NjDsnbw_188hRl08fnSZpFhRCmi0pUUKq4TERRch2zAnKtBWiDidSu4LHkCTptklgryZzKnZJS5jJX4JzSIIbk-vC3CfVnj21nV3Uf_G7S8pjFOpYM2K7FD60i1G0bcGGbUG1c-LYM7N6e_bVn9_bs0d4OujpAFSL-A0ZokEyLHwCzaWY</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2515854101</pqid></control><display><type>article</type><title>An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation</title><source>IEEE Electronic Library (IEL)</source><creator>Michelsanti, Daniel ; Tan, Zheng-Hua ; Zhang, Shi-Xiong ; Xu, Yong ; Yu, Meng ; Yu, Dong ; Jensen, Jesper</creator><creatorcontrib>Michelsanti, Daniel ; Tan, Zheng-Hua ; Zhang, Shi-Xiong ; Xu, Yong ; Yu, Meng ; Yu, Dong ; Jensen, Jesper</creatorcontrib><description>Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning , achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features ; visual features ; deep learning methods ; fusion techniques ; training targets and objective functions . In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals , since these methods can be more or less directly applied to audio-visual speech enhancement and separation. Finally, we survey commonly employed audio-visual speech datasets , given their central role in the development of data-driven approaches, and evaluation methods , because they are generally used to compare different systems and determine their performance.</description><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASLP.2021.3066303</identifier><identifier>CODEN: ITASFA</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Acoustics ; Audio-visual processing ; Deep learning ; Feature extraction ; Machine learning ; Microphones ; Separation ; Signal processing ; sound source separation ; Sound sources ; Speech ; Speech enhancement ; Speech processing ; speech separation ; speech synthesis ; Task analysis ; Videos ; Visual aspects ; Visual signals ; Visualization</subject><ispartof>IEEE/ACM transactions on audio, speech, and language processing, 2021, Vol.29, p.1368-1396</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c339t-de60d65d73cd2851c0b883089e748ac25427ea89758641a6ba6444b4b60aa6803</citedby><cites>FETCH-LOGICAL-c339t-de60d65d73cd2851c0b883089e748ac25427ea89758641a6ba6444b4b60aa6803</cites><orcidid>0000-0002-3575-1600 ; 0000-0001-6856-8928</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9380418$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,4010,27900,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9380418$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Michelsanti, Daniel</creatorcontrib><creatorcontrib>Tan, Zheng-Hua</creatorcontrib><creatorcontrib>Zhang, Shi-Xiong</creatorcontrib><creatorcontrib>Xu, Yong</creatorcontrib><creatorcontrib>Yu, Meng</creatorcontrib><creatorcontrib>Yu, Dong</creatorcontrib><creatorcontrib>Jensen, Jesper</creatorcontrib><title>An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation</title><title>IEEE/ACM transactions on audio, speech, and language processing</title><addtitle>TASLP</addtitle><description>Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning , achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features ; visual features ; deep learning methods ; fusion techniques ; training targets and objective functions . In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals , since these methods can be more or less directly applied to audio-visual speech enhancement and separation. Finally, we survey commonly employed audio-visual speech datasets , given their central role in the development of data-driven approaches, and evaluation methods , because they are generally used to compare different systems and determine their performance.</description><subject>Acoustics</subject><subject>Audio-visual processing</subject><subject>Deep learning</subject><subject>Feature extraction</subject><subject>Machine learning</subject><subject>Microphones</subject><subject>Separation</subject><subject>Signal processing</subject><subject>sound source separation</subject><subject>Sound sources</subject><subject>Speech</subject><subject>Speech enhancement</subject><subject>Speech processing</subject><subject>speech separation</subject><subject>speech synthesis</subject><subject>Task analysis</subject><subject>Videos</subject><subject>Visual aspects</subject><subject>Visual signals</subject><subject>Visualization</subject><issn>2329-9290</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kMtOwzAQRS0EElXpD8DGEuuU8SOOvQzlKUUqqIWt5SRTmqp1gpMU8fe0tLC6s7hnrnQIuWQwZgzMzTydZS9jDpyNBSglQJyQARfcREaAPP27uYFzMmrbFQAwSIxJ5IC8pp5Otxi2FX7RekHvEJsoQxd85T-iW9diSdO-rOrovWp7t6azBrFY0nu_dL7ADfqOOl_SGTYuuK6q_QU5W7h1i6NjDsnbw_188hRl08fnSZpFhRCmi0pUUKq4TERRch2zAnKtBWiDidSu4LHkCTptklgryZzKnZJS5jJX4JzSIIbk-vC3CfVnj21nV3Uf_G7S8pjFOpYM2K7FD60i1G0bcGGbUG1c-LYM7N6e_bVn9_bs0d4OujpAFSL-A0ZokEyLHwCzaWY</recordid><startdate>2021</startdate><enddate>2021</enddate><creator>Michelsanti, Daniel</creator><creator>Tan, Zheng-Hua</creator><creator>Zhang, Shi-Xiong</creator><creator>Xu, Yong</creator><creator>Yu, Meng</creator><creator>Yu, Dong</creator><creator>Jensen, Jesper</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-3575-1600</orcidid><orcidid>https://orcid.org/0000-0001-6856-8928</orcidid></search><sort><creationdate>2021</creationdate><title>An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation</title><author>Michelsanti, Daniel ; Tan, Zheng-Hua ; Zhang, Shi-Xiong ; Xu, Yong ; Yu, Meng ; Yu, Dong ; Jensen, Jesper</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c339t-de60d65d73cd2851c0b883089e748ac25427ea89758641a6ba6444b4b60aa6803</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Acoustics</topic><topic>Audio-visual processing</topic><topic>Deep learning</topic><topic>Feature extraction</topic><topic>Machine learning</topic><topic>Microphones</topic><topic>Separation</topic><topic>Signal processing</topic><topic>sound source separation</topic><topic>Sound sources</topic><topic>Speech</topic><topic>Speech enhancement</topic><topic>Speech processing</topic><topic>speech separation</topic><topic>speech synthesis</topic><topic>Task analysis</topic><topic>Videos</topic><topic>Visual aspects</topic><topic>Visual signals</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Michelsanti, Daniel</creatorcontrib><creatorcontrib>Tan, Zheng-Hua</creatorcontrib><creatorcontrib>Zhang, Shi-Xiong</creatorcontrib><creatorcontrib>Xu, Yong</creatorcontrib><creatorcontrib>Yu, Meng</creatorcontrib><creatorcontrib>Yu, Dong</creatorcontrib><creatorcontrib>Jensen, Jesper</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Michelsanti, Daniel</au><au>Tan, Zheng-Hua</au><au>Zhang, Shi-Xiong</au><au>Xu, Yong</au><au>Yu, Meng</au><au>Yu, Dong</au><au>Jensen, Jesper</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation</atitle><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle><stitle>TASLP</stitle><date>2021</date><risdate>2021</risdate><volume>29</volume><spage>1368</spage><epage>1396</epage><pages>1368-1396</pages><issn>2329-9290</issn><eissn>2329-9304</eissn><coden>ITASFA</coden><abstract>Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning , achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features ; visual features ; deep learning methods ; fusion techniques ; training targets and objective functions . In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals , since these methods can be more or less directly applied to audio-visual speech enhancement and separation. Finally, we survey commonly employed audio-visual speech datasets , given their central role in the development of data-driven approaches, and evaluation methods , because they are generally used to compare different systems and determine their performance.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TASLP.2021.3066303</doi><tpages>29</tpages><orcidid>https://orcid.org/0000-0002-3575-1600</orcidid><orcidid>https://orcid.org/0000-0001-6856-8928</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 2329-9290
ispartof IEEE/ACM transactions on audio, speech, and language processing, 2021, Vol.29, p.1368-1396
issn 2329-9290
2329-9304
language eng
recordid cdi_ieee_primary_9380418
source IEEE Electronic Library (IEL)
subjects Acoustics
Audio-visual processing
Deep learning
Feature extraction
Machine learning
Microphones
Separation
Signal processing
sound source separation
Sound sources
Speech
Speech enhancement
Speech processing
speech separation
speech synthesis
Task analysis
Videos
Visual aspects
Visual signals
Visualization
title An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-31T18%3A46%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=An%20Overview%20of%20Deep-Learning-Based%20Audio-Visual%20Speech%20Enhancement%20and%20Separation&rft.jtitle=IEEE/ACM%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Michelsanti,%20Daniel&rft.date=2021&rft.volume=29&rft.spage=1368&rft.epage=1396&rft.pages=1368-1396&rft.issn=2329-9290&rft.eissn=2329-9304&rft.coden=ITASFA&rft_id=info:doi/10.1109/TASLP.2021.3066303&rft_dat=%3Cproquest_RIE%3E2515854101%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2515854101&rft_id=info:pmid/&rft_ieee_id=9380418&rfr_iscdi=true