Past review, current progress, and challenges ahead on the cocktail party problem

The cocktail party problem, i.e., tracing and recognizing the speech of a specific speaker when multiple speakers talk simultaneously, is one of the critical problems yet to be solved to enable the wide application of automatic speech recognition (ASR) systems. In this overview paper, we review the...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Frontiers of information technology & electronic engineering 2018, Vol.19 (1), p.40-63
Hauptverfasser:	Qian, Yan-min, Weng, Chao, Chang, Xuan-kai, Wang, Shuai, Yu, Dong
Format:	Artikel
Sprache:	eng
Schlagworte:	Automatic speech recognition Beamforming Clustering Communications Engineering Computer Hardware Computer Science Computer Systems Organization and Communication Networks Computers Deep learning Electrical Engineering Electronics and Microelectronics Instrumentation Machine learning Musical instruments Networks Performance evaluation Permutations Review Scene analysis Signal processing Sound Speech Voice recognition
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	63
container_issue	1
container_start_page	40
container_title	Frontiers of information technology & electronic engineering
container_volume	19
creator	Qian, Yan-min Weng, Chao Chang, Xuan-kai Wang, Shuai Yu, Dong
description	The cocktail party problem, i.e., tracing and recognizing the speech of a specific speaker when multiple speakers talk simultaneously, is one of the critical problems yet to be solved to enable the wide application of automatic speech recognition (ASR) systems. In this overview paper, we review the techniques proposed in the last two decades in attacking this problem. We focus our discussions on the speech separation problem given its central role in the cocktail party environment, and describe the conventional single-channel techniques such as computational auditory scene analysis (CASA), non-negative matrix factorization (NMF) and generative models, the conventional multi-channel techniques such as beamforming and multi-channel blind source separation, and the newly developed deep learning-based techniques, such as deep clustering (DPCL), the deep attractor network (DANet), and permutation invariant training (PIT). We also present techniques developed to improve ASR accuracy and speaker identification in the cocktail party environment. We argue effectively exploiting information in the microphone array, the acoustic training set, and the language itself using a more powerful model. Better optimization objective and techniques will be the approach to solving the cocktail party problem.
doi_str_mv	10.1631/FITEE.1700814
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2918725035</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2918725035</sourcerecordid><originalsourceid>FETCH-LOGICAL-c357t-c6803dad2fa2810af1ed897ba03c0968581072103844240d5f35833fb64dec6e3</originalsourceid><addsrcrecordid>eNptkM1LAzEQxYMoWGqP3gNeu-sk2exmj1JaLRRUqOeQZmf74Xa3JqnS_97UVrx4mmH4vTe8R8gtg5Tlgt1PpvPxOGUFgGLZBelxKGVScgGXvztT2TUZeL8BAJazsihVj7y-GB-ow881fg2p3TuHbaA71y0dej-kpq2oXZmmwXaJnpoVmop2LQ0rpLaz78GsG7ozLhyOokWD2xtyVZvG4-A8--RtMp6PnpLZ8-N09DBLrJBFSGyuQFSm4rXhioGpGVaqLBYGhIUyVzIeC85AqCzjGVSyFlIJUS_yrEKbo-iTu5Nv_PuxRx_0ptu7Nr7UPGYtuAQhI5WcKOs67x3WeufWW-MOmoE-Fqd_itPn4iKfnngfuRjZ_bn-L_gGajJt_g</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2918725035</pqid></control><display><type>article</type><title>Past review, current progress, and challenges ahead on the cocktail party problem</title><source>SpringerLink Journals</source><source>Alma/SFX Local Collection</source><source>ProQuest Central</source><creator>Qian, Yan-min ; Weng, Chao ; Chang, Xuan-kai ; Wang, Shuai ; Yu, Dong</creator><creatorcontrib>Qian, Yan-min ; Weng, Chao ; Chang, Xuan-kai ; Wang, Shuai ; Yu, Dong</creatorcontrib><description>The cocktail party problem, i.e., tracing and recognizing the speech of a specific speaker when multiple speakers talk simultaneously, is one of the critical problems yet to be solved to enable the wide application of automatic speech recognition (ASR) systems. In this overview paper, we review the techniques proposed in the last two decades in attacking this problem. We focus our discussions on the speech separation problem given its central role in the cocktail party environment, and describe the conventional single-channel techniques such as computational auditory scene analysis (CASA), non-negative matrix factorization (NMF) and generative models, the conventional multi-channel techniques such as beamforming and multi-channel blind source separation, and the newly developed deep learning-based techniques, such as deep clustering (DPCL), the deep attractor network (DANet), and permutation invariant training (PIT). We also present techniques developed to improve ASR accuracy and speaker identification in the cocktail party environment. We argue effectively exploiting information in the microphone array, the acoustic training set, and the language itself using a more powerful model. Better optimization objective and techniques will be the approach to solving the cocktail party problem.</description><identifier>ISSN: 2095-9184</identifier><identifier>EISSN: 2095-9230</identifier><identifier>DOI: 10.1631/FITEE.1700814</identifier><language>eng</language><publisher>Hangzhou: Zhejiang University Press</publisher><subject>Automatic speech recognition ; Beamforming ; Clustering ; Communications Engineering ; Computer Hardware ; Computer Science ; Computer Systems Organization and Communication Networks ; Computers ; Deep learning ; Electrical Engineering ; Electronics and Microelectronics ; Instrumentation ; Machine learning ; Musical instruments ; Networks ; Performance evaluation ; Permutations ; Review ; Scene analysis ; Signal processing ; Sound ; Speech ; Voice recognition</subject><ispartof>Frontiers of information technology & electronic engineering, 2018, Vol.19 (1), p.40-63</ispartof><rights>Zhejiang University and Springer-Verlag GmbH Germany, part of Springer Nature 2018</rights><rights>Zhejiang University and Springer-Verlag GmbH Germany, part of Springer Nature 2018.</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c357t-c6803dad2fa2810af1ed897ba03c0968581072103844240d5f35833fb64dec6e3</citedby><cites>FETCH-LOGICAL-c357t-c6803dad2fa2810af1ed897ba03c0968581072103844240d5f35833fb64dec6e3</cites><orcidid>0000-0002-0314-3790</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktopdf>$$Uhttps://link.springer.com/content/pdf/10.1631/FITEE.1700814$$EPDF$$P50$$Gspringer$$H</linktopdf><linktohtml>$$Uhttps://www.proquest.com/docview/2918725035?pq-origsite=primo$$EHTML$$P50$$Gproquest$$H</linktohtml><link.rule.ids>314,776,780,21368,27903,27904,33723,41467,42536,43784,51297</link.rule.ids></links><search><creatorcontrib>Qian, Yan-min</creatorcontrib><creatorcontrib>Weng, Chao</creatorcontrib><creatorcontrib>Chang, Xuan-kai</creatorcontrib><creatorcontrib>Wang, Shuai</creatorcontrib><creatorcontrib>Yu, Dong</creatorcontrib><title>Past review, current progress, and challenges ahead on the cocktail party problem</title><title>Frontiers of information technology & electronic engineering</title><addtitle>Frontiers Inf Technol Electronic Eng</addtitle><description>The cocktail party problem, i.e., tracing and recognizing the speech of a specific speaker when multiple speakers talk simultaneously, is one of the critical problems yet to be solved to enable the wide application of automatic speech recognition (ASR) systems. In this overview paper, we review the techniques proposed in the last two decades in attacking this problem. We focus our discussions on the speech separation problem given its central role in the cocktail party environment, and describe the conventional single-channel techniques such as computational auditory scene analysis (CASA), non-negative matrix factorization (NMF) and generative models, the conventional multi-channel techniques such as beamforming and multi-channel blind source separation, and the newly developed deep learning-based techniques, such as deep clustering (DPCL), the deep attractor network (DANet), and permutation invariant training (PIT). We also present techniques developed to improve ASR accuracy and speaker identification in the cocktail party environment. We argue effectively exploiting information in the microphone array, the acoustic training set, and the language itself using a more powerful model. Better optimization objective and techniques will be the approach to solving the cocktail party problem.</description><subject>Automatic speech recognition</subject><subject>Beamforming</subject><subject>Clustering</subject><subject>Communications Engineering</subject><subject>Computer Hardware</subject><subject>Computer Science</subject><subject>Computer Systems Organization and Communication Networks</subject><subject>Computers</subject><subject>Deep learning</subject><subject>Electrical Engineering</subject><subject>Electronics and Microelectronics</subject><subject>Instrumentation</subject><subject>Machine learning</subject><subject>Musical instruments</subject><subject>Networks</subject><subject>Performance evaluation</subject><subject>Permutations</subject><subject>Review</subject><subject>Scene analysis</subject><subject>Signal processing</subject><subject>Sound</subject><subject>Speech</subject><subject>Voice recognition</subject><issn>2095-9184</issn><issn>2095-9230</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNptkM1LAzEQxYMoWGqP3gNeu-sk2exmj1JaLRRUqOeQZmf74Xa3JqnS_97UVrx4mmH4vTe8R8gtg5Tlgt1PpvPxOGUFgGLZBelxKGVScgGXvztT2TUZeL8BAJazsihVj7y-GB-ow881fg2p3TuHbaA71y0dej-kpq2oXZmmwXaJnpoVmop2LQ0rpLaz78GsG7ozLhyOokWD2xtyVZvG4-A8--RtMp6PnpLZ8-N09DBLrJBFSGyuQFSm4rXhioGpGVaqLBYGhIUyVzIeC85AqCzjGVSyFlIJUS_yrEKbo-iTu5Nv_PuxRx_0ptu7Nr7UPGYtuAQhI5WcKOs67x3WeufWW-MOmoE-Fqd_itPn4iKfnngfuRjZ_bn-L_gGajJt_g</recordid><startdate>2018</startdate><enddate>2018</enddate><creator>Qian, Yan-min</creator><creator>Weng, Chao</creator><creator>Chang, Xuan-kai</creator><creator>Wang, Shuai</creator><creator>Yu, Dong</creator><general>Zhejiang University Press</general><general>Springer Nature B.V</general><scope>AAYXX</scope><scope>CITATION</scope><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>L6V</scope><scope>M7S</scope><scope>P5Z</scope><scope>P62</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PTHSS</scope><orcidid>https://orcid.org/0000-0002-0314-3790</orcidid></search><sort><creationdate>2018</creationdate><title>Past review, current progress, and challenges ahead on the cocktail party problem</title><author>Qian, Yan-min ; Weng, Chao ; Chang, Xuan-kai ; Wang, Shuai ; Yu, Dong</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c357t-c6803dad2fa2810af1ed897ba03c0968581072103844240d5f35833fb64dec6e3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Automatic speech recognition</topic><topic>Beamforming</topic><topic>Clustering</topic><topic>Communications Engineering</topic><topic>Computer Hardware</topic><topic>Computer Science</topic><topic>Computer Systems Organization and Communication Networks</topic><topic>Computers</topic><topic>Deep learning</topic><topic>Electrical Engineering</topic><topic>Electronics and Microelectronics</topic><topic>Instrumentation</topic><topic>Machine learning</topic><topic>Musical instruments</topic><topic>Networks</topic><topic>Performance evaluation</topic><topic>Permutations</topic><topic>Review</topic><topic>Scene analysis</topic><topic>Signal processing</topic><topic>Sound</topic><topic>Speech</topic><topic>Voice recognition</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Qian, Yan-min</creatorcontrib><creatorcontrib>Weng, Chao</creatorcontrib><creatorcontrib>Chang, Xuan-kai</creatorcontrib><creatorcontrib>Wang, Shuai</creatorcontrib><creatorcontrib>Yu, Dong</creatorcontrib><collection>CrossRef</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies & Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Advanced Technologies & Aerospace Database</collection><collection>ProQuest Advanced Technologies & Aerospace Collection</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>Engineering Collection</collection><jtitle>Frontiers of information technology & electronic engineering</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Qian, Yan-min</au><au>Weng, Chao</au><au>Chang, Xuan-kai</au><au>Wang, Shuai</au><au>Yu, Dong</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Past review, current progress, and challenges ahead on the cocktail party problem</atitle><jtitle>Frontiers of information technology & electronic engineering</jtitle><stitle>Frontiers Inf Technol Electronic Eng</stitle><date>2018</date><risdate>2018</risdate><volume>19</volume><issue>1</issue><spage>40</spage><epage>63</epage><pages>40-63</pages><issn>2095-9184</issn><eissn>2095-9230</eissn><abstract>The cocktail party problem, i.e., tracing and recognizing the speech of a specific speaker when multiple speakers talk simultaneously, is one of the critical problems yet to be solved to enable the wide application of automatic speech recognition (ASR) systems. In this overview paper, we review the techniques proposed in the last two decades in attacking this problem. We focus our discussions on the speech separation problem given its central role in the cocktail party environment, and describe the conventional single-channel techniques such as computational auditory scene analysis (CASA), non-negative matrix factorization (NMF) and generative models, the conventional multi-channel techniques such as beamforming and multi-channel blind source separation, and the newly developed deep learning-based techniques, such as deep clustering (DPCL), the deep attractor network (DANet), and permutation invariant training (PIT). We also present techniques developed to improve ASR accuracy and speaker identification in the cocktail party environment. We argue effectively exploiting information in the microphone array, the acoustic training set, and the language itself using a more powerful model. Better optimization objective and techniques will be the approach to solving the cocktail party problem.</abstract><cop>Hangzhou</cop><pub>Zhejiang University Press</pub><doi>10.1631/FITEE.1700814</doi><tpages>24</tpages><orcidid>https://orcid.org/0000-0002-0314-3790</orcidid></addata></record>
fulltext	fulltext
identifier	ISSN: 2095-9184
ispartof	Frontiers of information technology & electronic engineering, 2018, Vol.19 (1), p.40-63
issn	2095-9184 2095-9230
language	eng
recordid	cdi_proquest_journals_2918725035
source	SpringerLink Journals; Alma/SFX Local Collection; ProQuest Central
subjects	Automatic speech recognition Beamforming Clustering Communications Engineering Computer Hardware Computer Science Computer Systems Organization and Communication Networks Computers Deep learning Electrical Engineering Electronics and Microelectronics Instrumentation Machine learning Musical instruments Networks Performance evaluation Permutations Review Scene analysis Signal processing Sound Speech Voice recognition
title	Past review, current progress, and challenges ahead on the cocktail party problem
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-27T06%3A38%3A28IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Past%20review,%20current%20progress,%20and%20challenges%20ahead%20on%20the%20cocktail%20party%20problem&rft.jtitle=Frontiers%20of%20information%20technology%20&%20electronic%20engineering&rft.au=Qian,%20Yan-min&rft.date=2018&rft.volume=19&rft.issue=1&rft.spage=40&rft.epage=63&rft.pages=40-63&rft.issn=2095-9184&rft.eissn=2095-9230&rft_id=info:doi/10.1631/FITEE.1700814&rft_dat=%3Cproquest_cross%3E2918725035%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2918725035&rft_id=info:pmid/&rfr_iscdi=true