An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniq...
Gespeichert in:
Veröffentlicht in: | IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2021, Vol.29, p.1368-1396 |
---|---|
Hauptverfasser: | , , , , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 1396 |
---|---|
container_issue | |
container_start_page | 1368 |
container_title | IEEE/ACM transactions on audio, speech, and language processing |
container_volume | 29 |
creator | Michelsanti, Daniel Tan, Zheng-Hua Zhang, Shi-Xiong Xu, Yong Yu, Meng Yu, Dong Jensen, Jesper |
description | Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning , achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features ; visual features ; deep learning methods ; fusion techniques ; training targets and objective functions . In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals , since these methods can be more or less directly applied to audio-visual speech enhancement and separation. Finally, we survey commonly employed audio-visual speech datasets , given their central role in the development of data-driven approaches, and evaluation methods , because they are generally used to compare different systems and determine their performance. |
doi_str_mv | 10.1109/TASLP.2021.3066303 |
format | Article |
fullrecord | <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_9380418</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9380418</ieee_id><sourcerecordid>2515854101</sourcerecordid><originalsourceid>FETCH-LOGICAL-c339t-de60d65d73cd2851c0b883089e748ac25427ea89758641a6ba6444b4b60aa6803</originalsourceid><addsrcrecordid>eNo9kMtOwzAQRS0EElXpD8DGEuuU8SOOvQzlKUUqqIWt5SRTmqp1gpMU8fe0tLC6s7hnrnQIuWQwZgzMzTydZS9jDpyNBSglQJyQARfcREaAPP27uYFzMmrbFQAwSIxJ5IC8pp5Otxi2FX7RekHvEJsoQxd85T-iW9diSdO-rOrovWp7t6azBrFY0nu_dL7ADfqOOl_SGTYuuK6q_QU5W7h1i6NjDsnbw_188hRl08fnSZpFhRCmi0pUUKq4TERRch2zAnKtBWiDidSu4LHkCTptklgryZzKnZJS5jJX4JzSIIbk-vC3CfVnj21nV3Uf_G7S8pjFOpYM2K7FD60i1G0bcGGbUG1c-LYM7N6e_bVn9_bs0d4OujpAFSL-A0ZokEyLHwCzaWY</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2515854101</pqid></control><display><type>article</type><title>An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation</title><source>IEEE Electronic Library (IEL)</source><creator>Michelsanti, Daniel ; Tan, Zheng-Hua ; Zhang, Shi-Xiong ; Xu, Yong ; Yu, Meng ; Yu, Dong ; Jensen, Jesper</creator><creatorcontrib>Michelsanti, Daniel ; Tan, Zheng-Hua ; Zhang, Shi-Xiong ; Xu, Yong ; Yu, Meng ; Yu, Dong ; Jensen, Jesper</creatorcontrib><description>Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning , achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features ; visual features ; deep learning methods ; fusion techniques ; training targets and objective functions . In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals , since these methods can be more or less directly applied to audio-visual speech enhancement and separation. Finally, we survey commonly employed audio-visual speech datasets , given their central role in the development of data-driven approaches, and evaluation methods , because they are generally used to compare different systems and determine their performance.</description><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASLP.2021.3066303</identifier><identifier>CODEN: ITASFA</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Acoustics ; Audio-visual processing ; Deep learning ; Feature extraction ; Machine learning ; Microphones ; Separation ; Signal processing ; sound source separation ; Sound sources ; Speech ; Speech enhancement ; Speech processing ; speech separation ; speech synthesis ; Task analysis ; Videos ; Visual aspects ; Visual signals ; Visualization</subject><ispartof>IEEE/ACM transactions on audio, speech, and language processing, 2021, Vol.29, p.1368-1396</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c339t-de60d65d73cd2851c0b883089e748ac25427ea89758641a6ba6444b4b60aa6803</citedby><cites>FETCH-LOGICAL-c339t-de60d65d73cd2851c0b883089e748ac25427ea89758641a6ba6444b4b60aa6803</cites><orcidid>0000-0002-3575-1600 ; 0000-0001-6856-8928</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9380418$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,4010,27900,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9380418$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Michelsanti, Daniel</creatorcontrib><creatorcontrib>Tan, Zheng-Hua</creatorcontrib><creatorcontrib>Zhang, Shi-Xiong</creatorcontrib><creatorcontrib>Xu, Yong</creatorcontrib><creatorcontrib>Yu, Meng</creatorcontrib><creatorcontrib>Yu, Dong</creatorcontrib><creatorcontrib>Jensen, Jesper</creatorcontrib><title>An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation</title><title>IEEE/ACM transactions on audio, speech, and language processing</title><addtitle>TASLP</addtitle><description>Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning , achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features ; visual features ; deep learning methods ; fusion techniques ; training targets and objective functions . In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals , since these methods can be more or less directly applied to audio-visual speech enhancement and separation. Finally, we survey commonly employed audio-visual speech datasets , given their central role in the development of data-driven approaches, and evaluation methods , because they are generally used to compare different systems and determine their performance.</description><subject>Acoustics</subject><subject>Audio-visual processing</subject><subject>Deep learning</subject><subject>Feature extraction</subject><subject>Machine learning</subject><subject>Microphones</subject><subject>Separation</subject><subject>Signal processing</subject><subject>sound source separation</subject><subject>Sound sources</subject><subject>Speech</subject><subject>Speech enhancement</subject><subject>Speech processing</subject><subject>speech separation</subject><subject>speech synthesis</subject><subject>Task analysis</subject><subject>Videos</subject><subject>Visual aspects</subject><subject>Visual signals</subject><subject>Visualization</subject><issn>2329-9290</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kMtOwzAQRS0EElXpD8DGEuuU8SOOvQzlKUUqqIWt5SRTmqp1gpMU8fe0tLC6s7hnrnQIuWQwZgzMzTydZS9jDpyNBSglQJyQARfcREaAPP27uYFzMmrbFQAwSIxJ5IC8pp5Otxi2FX7RekHvEJsoQxd85T-iW9diSdO-rOrovWp7t6azBrFY0nu_dL7ADfqOOl_SGTYuuK6q_QU5W7h1i6NjDsnbw_188hRl08fnSZpFhRCmi0pUUKq4TERRch2zAnKtBWiDidSu4LHkCTptklgryZzKnZJS5jJX4JzSIIbk-vC3CfVnj21nV3Uf_G7S8pjFOpYM2K7FD60i1G0bcGGbUG1c-LYM7N6e_bVn9_bs0d4OujpAFSL-A0ZokEyLHwCzaWY</recordid><startdate>2021</startdate><enddate>2021</enddate><creator>Michelsanti, Daniel</creator><creator>Tan, Zheng-Hua</creator><creator>Zhang, Shi-Xiong</creator><creator>Xu, Yong</creator><creator>Yu, Meng</creator><creator>Yu, Dong</creator><creator>Jensen, Jesper</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-3575-1600</orcidid><orcidid>https://orcid.org/0000-0001-6856-8928</orcidid></search><sort><creationdate>2021</creationdate><title>An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation</title><author>Michelsanti, Daniel ; Tan, Zheng-Hua ; Zhang, Shi-Xiong ; Xu, Yong ; Yu, Meng ; Yu, Dong ; Jensen, Jesper</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c339t-de60d65d73cd2851c0b883089e748ac25427ea89758641a6ba6444b4b60aa6803</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Acoustics</topic><topic>Audio-visual processing</topic><topic>Deep learning</topic><topic>Feature extraction</topic><topic>Machine learning</topic><topic>Microphones</topic><topic>Separation</topic><topic>Signal processing</topic><topic>sound source separation</topic><topic>Sound sources</topic><topic>Speech</topic><topic>Speech enhancement</topic><topic>Speech processing</topic><topic>speech separation</topic><topic>speech synthesis</topic><topic>Task analysis</topic><topic>Videos</topic><topic>Visual aspects</topic><topic>Visual signals</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Michelsanti, Daniel</creatorcontrib><creatorcontrib>Tan, Zheng-Hua</creatorcontrib><creatorcontrib>Zhang, Shi-Xiong</creatorcontrib><creatorcontrib>Xu, Yong</creatorcontrib><creatorcontrib>Yu, Meng</creatorcontrib><creatorcontrib>Yu, Dong</creatorcontrib><creatorcontrib>Jensen, Jesper</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Michelsanti, Daniel</au><au>Tan, Zheng-Hua</au><au>Zhang, Shi-Xiong</au><au>Xu, Yong</au><au>Yu, Meng</au><au>Yu, Dong</au><au>Jensen, Jesper</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation</atitle><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle><stitle>TASLP</stitle><date>2021</date><risdate>2021</risdate><volume>29</volume><spage>1368</spage><epage>1396</epage><pages>1368-1396</pages><issn>2329-9290</issn><eissn>2329-9304</eissn><coden>ITASFA</coden><abstract>Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning , achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features ; visual features ; deep learning methods ; fusion techniques ; training targets and objective functions . In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals , since these methods can be more or less directly applied to audio-visual speech enhancement and separation. Finally, we survey commonly employed audio-visual speech datasets , given their central role in the development of data-driven approaches, and evaluation methods , because they are generally used to compare different systems and determine their performance.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TASLP.2021.3066303</doi><tpages>29</tpages><orcidid>https://orcid.org/0000-0002-3575-1600</orcidid><orcidid>https://orcid.org/0000-0001-6856-8928</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | ISSN: 2329-9290 |
ispartof | IEEE/ACM transactions on audio, speech, and language processing, 2021, Vol.29, p.1368-1396 |
issn | 2329-9290 2329-9304 |
language | eng |
recordid | cdi_ieee_primary_9380418 |
source | IEEE Electronic Library (IEL) |
subjects | Acoustics Audio-visual processing Deep learning Feature extraction Machine learning Microphones Separation Signal processing sound source separation Sound sources Speech Speech enhancement Speech processing speech separation speech synthesis Task analysis Videos Visual aspects Visual signals Visualization |
title | An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-31T18%3A46%3A47IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=An%20Overview%20of%20Deep-Learning-Based%20Audio-Visual%20Speech%20Enhancement%20and%20Separation&rft.jtitle=IEEE/ACM%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Michelsanti,%20Daniel&rft.date=2021&rft.volume=29&rft.spage=1368&rft.epage=1396&rft.pages=1368-1396&rft.issn=2329-9290&rft.eissn=2329-9304&rft.coden=ITASFA&rft_id=info:doi/10.1109/TASLP.2021.3066303&rft_dat=%3Cproquest_RIE%3E2515854101%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2515854101&rft_id=info:pmid/&rft_ieee_id=9380418&rfr_iscdi=true |