Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features

This study proposes a cross-domain multi-objective speech assessment model, called MOSA-Net, which can simultaneously estimate the speech quality, intelligibility, and distortion assessment scores of an input speech signal. MOSA-Net comprises a convolutional neural network and bidirectional long sho...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2023, Vol.31, p.54-70
Hauptverfasser: Zezario, Ryandhimas E., Fu, Szu-Wei, Chen, Fei, Fuh, Chiou-Shann, Wang, Hsin-Min, Tsao, Yu
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 70
container_issue
container_start_page 54
container_title IEEE/ACM transactions on audio, speech, and language processing
container_volume 31
creator Zezario, Ryandhimas E.
Fu, Szu-Wei
Chen, Fei
Fuh, Chiou-Shann
Wang, Hsin-Min
Tsao, Yu
description This study proposes a cross-domain multi-objective speech assessment model, called MOSA-Net, which can simultaneously estimate the speech quality, intelligibility, and distortion assessment scores of an input speech signal. MOSA-Net comprises a convolutional neural network and bidirectional long short-term memory architecture for representation extraction, and a multiplicative attention layer and a fully connected layer for each assessment metric prediction. Additionally, cross-domain features (spectral and time-domain features) and latent representations from self-supervised learned (SSL) models are used as inputs to combine rich acoustic information to obtain more accurate assessments. Experimental results show that in both seen and unseen noise environments, MOSA-Net can improve the linear correlation coefficient (LCC) scores in perceptual evaluation of speech quality (PESQ) prediction, compared to Quality-Net, an existing single-task model for PESQ prediction, and improve LCC scores in short-time objective intelligibility (STOI) prediction, compared to STOI-Net, an existing single-task model for STOI prediction. Moreover, MOSA-Net can be used as a pre-trained model to be effectively adapted to an assessment model for predicting subjective quality and intelligibility scores with a limited amount of training data. Experimental results show that MOSA-Net can improve LCC scores in mean opinion score (MOS) predictions, compared to MOS-SSL, a strong single-task model for MOS prediction. We further adopt the latent representations of MOSA-Net to guide the speech enhancement (SE) process and derive a quality-intelligibility (QI)-aware SE (QIA-SE) approach. Experimental results show that QIA-SE outperforms the baseline SE system with improved PESQ scores in both seen and unseen noise environments over a baseline SE model.
doi_str_mv 10.1109/TASLP.2022.3205757
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1109_TASLP_2022_3205757</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9905733</ieee_id><sourcerecordid>2747609114</sourcerecordid><originalsourceid>FETCH-LOGICAL-c339t-c599723f539160afacd97de91ea3b3bdf214bbb55e683ad814e257f335eac8dc3</originalsourceid><addsrcrecordid>eNo9kE9PAjEQxRujiUT5Anpp4rnYP1tKjwiiJCAmYDw23e2sLIHdte2a-O3dFfQ0M5n3Zl5-CN0wOmCM6vvNeL14HXDK-UBwKpVUZ6jHBddEC5qc__Vc00vUD2FHKWVUaa2SHnJTgBovwPqyKD_Igw3g8EtVknkZfROKL8DLZh8Lskp3kMVuXtcA2RaPQ4AQDlBGvKwc7PF7Ebd44qsQyLQ62KLEM7Cx8RCu0UVu9wH6p3qF3maPm8kzWaye5pPxgmRC6Egy2WbiIpdCsyG1uc2cVg40AytSkbqcsyRNUylhOBLWjVgCXKpcCAk2G7lMXKG7493aV58NhGh2VePL9qXhKlFDqhlLWhU_qrIuq4fc1L44WP9tGDUdUPML1HRAzQloa7o9mgoA-Ddo3W6FED_wFXJh</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2747609114</pqid></control><display><type>article</type><title>Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features</title><source>ACM Digital Library Complete</source><source>IEEE Electronic Library (IEL)</source><creator>Zezario, Ryandhimas E. ; Fu, Szu-Wei ; Chen, Fei ; Fuh, Chiou-Shann ; Wang, Hsin-Min ; Tsao, Yu</creator><creatorcontrib>Zezario, Ryandhimas E. ; Fu, Szu-Wei ; Chen, Fei ; Fuh, Chiou-Shann ; Wang, Hsin-Min ; Tsao, Yu</creatorcontrib><description>This study proposes a cross-domain multi-objective speech assessment model, called MOSA-Net, which can simultaneously estimate the speech quality, intelligibility, and distortion assessment scores of an input speech signal. MOSA-Net comprises a convolutional neural network and bidirectional long short-term memory architecture for representation extraction, and a multiplicative attention layer and a fully connected layer for each assessment metric prediction. Additionally, cross-domain features (spectral and time-domain features) and latent representations from self-supervised learned (SSL) models are used as inputs to combine rich acoustic information to obtain more accurate assessments. Experimental results show that in both seen and unseen noise environments, MOSA-Net can improve the linear correlation coefficient (LCC) scores in perceptual evaluation of speech quality (PESQ) prediction, compared to Quality-Net, an existing single-task model for PESQ prediction, and improve LCC scores in short-time objective intelligibility (STOI) prediction, compared to STOI-Net, an existing single-task model for STOI prediction. Moreover, MOSA-Net can be used as a pre-trained model to be effectively adapted to an assessment model for predicting subjective quality and intelligibility scores with a limited amount of training data. Experimental results show that MOSA-Net can improve LCC scores in mean opinion score (MOS) predictions, compared to MOS-SSL, a strong single-task model for MOS prediction. We further adopt the latent representations of MOSA-Net to guide the speech enhancement (SE) process and derive a quality-intelligibility (QI)-aware SE (QIA-SE) approach. Experimental results show that QIA-SE outperforms the baseline SE system with improved PESQ scores in both seen and unseen noise environments over a baseline SE model.</description><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASLP.2022.3205757</identifier><identifier>CODEN: ITASFA</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Acoustic distortion ; Acoustics ; Adaptation models ; Artificial neural networks ; Computer architecture ; Correlation coefficients ; Deep learning ; Intelligibility ; Measurement ; multi-objective learning ; non-intrusive speech assessment models ; Predictive models ; Psychoacoustic models ; Quality assessment ; Representations ; Speech ; Speech enhancement ; Speech processing</subject><ispartof>IEEE/ACM transactions on audio, speech, and language processing, 2023, Vol.31, p.54-70</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c339t-c599723f539160afacd97de91ea3b3bdf214bbb55e683ad814e257f335eac8dc3</citedby><cites>FETCH-LOGICAL-c339t-c599723f539160afacd97de91ea3b3bdf214bbb55e683ad814e257f335eac8dc3</cites><orcidid>0000-0003-3599-5071 ; 0000-0002-3487-8212 ; 0000-0002-6988-492X ; 0000-0002-6174-2556 ; 0000-0001-6956-0418 ; 0000-0001-7319-8263</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9905733$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,796,4024,27923,27924,27925,54758</link.rule.ids></links><search><creatorcontrib>Zezario, Ryandhimas E.</creatorcontrib><creatorcontrib>Fu, Szu-Wei</creatorcontrib><creatorcontrib>Chen, Fei</creatorcontrib><creatorcontrib>Fuh, Chiou-Shann</creatorcontrib><creatorcontrib>Wang, Hsin-Min</creatorcontrib><creatorcontrib>Tsao, Yu</creatorcontrib><title>Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features</title><title>IEEE/ACM transactions on audio, speech, and language processing</title><addtitle>TASLP</addtitle><description>This study proposes a cross-domain multi-objective speech assessment model, called MOSA-Net, which can simultaneously estimate the speech quality, intelligibility, and distortion assessment scores of an input speech signal. MOSA-Net comprises a convolutional neural network and bidirectional long short-term memory architecture for representation extraction, and a multiplicative attention layer and a fully connected layer for each assessment metric prediction. Additionally, cross-domain features (spectral and time-domain features) and latent representations from self-supervised learned (SSL) models are used as inputs to combine rich acoustic information to obtain more accurate assessments. Experimental results show that in both seen and unseen noise environments, MOSA-Net can improve the linear correlation coefficient (LCC) scores in perceptual evaluation of speech quality (PESQ) prediction, compared to Quality-Net, an existing single-task model for PESQ prediction, and improve LCC scores in short-time objective intelligibility (STOI) prediction, compared to STOI-Net, an existing single-task model for STOI prediction. Moreover, MOSA-Net can be used as a pre-trained model to be effectively adapted to an assessment model for predicting subjective quality and intelligibility scores with a limited amount of training data. Experimental results show that MOSA-Net can improve LCC scores in mean opinion score (MOS) predictions, compared to MOS-SSL, a strong single-task model for MOS prediction. We further adopt the latent representations of MOSA-Net to guide the speech enhancement (SE) process and derive a quality-intelligibility (QI)-aware SE (QIA-SE) approach. Experimental results show that QIA-SE outperforms the baseline SE system with improved PESQ scores in both seen and unseen noise environments over a baseline SE model.</description><subject>Acoustic distortion</subject><subject>Acoustics</subject><subject>Adaptation models</subject><subject>Artificial neural networks</subject><subject>Computer architecture</subject><subject>Correlation coefficients</subject><subject>Deep learning</subject><subject>Intelligibility</subject><subject>Measurement</subject><subject>multi-objective learning</subject><subject>non-intrusive speech assessment models</subject><subject>Predictive models</subject><subject>Psychoacoustic models</subject><subject>Quality assessment</subject><subject>Representations</subject><subject>Speech</subject><subject>Speech enhancement</subject><subject>Speech processing</subject><issn>2329-9290</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><recordid>eNo9kE9PAjEQxRujiUT5Anpp4rnYP1tKjwiiJCAmYDw23e2sLIHdte2a-O3dFfQ0M5n3Zl5-CN0wOmCM6vvNeL14HXDK-UBwKpVUZ6jHBddEC5qc__Vc00vUD2FHKWVUaa2SHnJTgBovwPqyKD_Igw3g8EtVknkZfROKL8DLZh8Lskp3kMVuXtcA2RaPQ4AQDlBGvKwc7PF7Ebd44qsQyLQ62KLEM7Cx8RCu0UVu9wH6p3qF3maPm8kzWaye5pPxgmRC6Egy2WbiIpdCsyG1uc2cVg40AytSkbqcsyRNUylhOBLWjVgCXKpcCAk2G7lMXKG7493aV58NhGh2VePL9qXhKlFDqhlLWhU_qrIuq4fc1L44WP9tGDUdUPML1HRAzQloa7o9mgoA-Ddo3W6FED_wFXJh</recordid><startdate>2023</startdate><enddate>2023</enddate><creator>Zezario, Ryandhimas E.</creator><creator>Fu, Szu-Wei</creator><creator>Chen, Fei</creator><creator>Fuh, Chiou-Shann</creator><creator>Wang, Hsin-Min</creator><creator>Tsao, Yu</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0003-3599-5071</orcidid><orcidid>https://orcid.org/0000-0002-3487-8212</orcidid><orcidid>https://orcid.org/0000-0002-6988-492X</orcidid><orcidid>https://orcid.org/0000-0002-6174-2556</orcidid><orcidid>https://orcid.org/0000-0001-6956-0418</orcidid><orcidid>https://orcid.org/0000-0001-7319-8263</orcidid></search><sort><creationdate>2023</creationdate><title>Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features</title><author>Zezario, Ryandhimas E. ; Fu, Szu-Wei ; Chen, Fei ; Fuh, Chiou-Shann ; Wang, Hsin-Min ; Tsao, Yu</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c339t-c599723f539160afacd97de91ea3b3bdf214bbb55e683ad814e257f335eac8dc3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Acoustic distortion</topic><topic>Acoustics</topic><topic>Adaptation models</topic><topic>Artificial neural networks</topic><topic>Computer architecture</topic><topic>Correlation coefficients</topic><topic>Deep learning</topic><topic>Intelligibility</topic><topic>Measurement</topic><topic>multi-objective learning</topic><topic>non-intrusive speech assessment models</topic><topic>Predictive models</topic><topic>Psychoacoustic models</topic><topic>Quality assessment</topic><topic>Representations</topic><topic>Speech</topic><topic>Speech enhancement</topic><topic>Speech processing</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Zezario, Ryandhimas E.</creatorcontrib><creatorcontrib>Fu, Szu-Wei</creatorcontrib><creatorcontrib>Chen, Fei</creatorcontrib><creatorcontrib>Fuh, Chiou-Shann</creatorcontrib><creatorcontrib>Wang, Hsin-Min</creatorcontrib><creatorcontrib>Tsao, Yu</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Zezario, Ryandhimas E.</au><au>Fu, Szu-Wei</au><au>Chen, Fei</au><au>Fuh, Chiou-Shann</au><au>Wang, Hsin-Min</au><au>Tsao, Yu</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features</atitle><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle><stitle>TASLP</stitle><date>2023</date><risdate>2023</risdate><volume>31</volume><spage>54</spage><epage>70</epage><pages>54-70</pages><issn>2329-9290</issn><eissn>2329-9304</eissn><coden>ITASFA</coden><abstract>This study proposes a cross-domain multi-objective speech assessment model, called MOSA-Net, which can simultaneously estimate the speech quality, intelligibility, and distortion assessment scores of an input speech signal. MOSA-Net comprises a convolutional neural network and bidirectional long short-term memory architecture for representation extraction, and a multiplicative attention layer and a fully connected layer for each assessment metric prediction. Additionally, cross-domain features (spectral and time-domain features) and latent representations from self-supervised learned (SSL) models are used as inputs to combine rich acoustic information to obtain more accurate assessments. Experimental results show that in both seen and unseen noise environments, MOSA-Net can improve the linear correlation coefficient (LCC) scores in perceptual evaluation of speech quality (PESQ) prediction, compared to Quality-Net, an existing single-task model for PESQ prediction, and improve LCC scores in short-time objective intelligibility (STOI) prediction, compared to STOI-Net, an existing single-task model for STOI prediction. Moreover, MOSA-Net can be used as a pre-trained model to be effectively adapted to an assessment model for predicting subjective quality and intelligibility scores with a limited amount of training data. Experimental results show that MOSA-Net can improve LCC scores in mean opinion score (MOS) predictions, compared to MOS-SSL, a strong single-task model for MOS prediction. We further adopt the latent representations of MOSA-Net to guide the speech enhancement (SE) process and derive a quality-intelligibility (QI)-aware SE (QIA-SE) approach. Experimental results show that QIA-SE outperforms the baseline SE system with improved PESQ scores in both seen and unseen noise environments over a baseline SE model.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TASLP.2022.3205757</doi><tpages>17</tpages><orcidid>https://orcid.org/0000-0003-3599-5071</orcidid><orcidid>https://orcid.org/0000-0002-3487-8212</orcidid><orcidid>https://orcid.org/0000-0002-6988-492X</orcidid><orcidid>https://orcid.org/0000-0002-6174-2556</orcidid><orcidid>https://orcid.org/0000-0001-6956-0418</orcidid><orcidid>https://orcid.org/0000-0001-7319-8263</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2329-9290
ispartof IEEE/ACM transactions on audio, speech, and language processing, 2023, Vol.31, p.54-70
issn 2329-9290
2329-9304
language eng
recordid cdi_crossref_primary_10_1109_TASLP_2022_3205757
source ACM Digital Library Complete; IEEE Electronic Library (IEL)
subjects Acoustic distortion
Acoustics
Adaptation models
Artificial neural networks
Computer architecture
Correlation coefficients
Deep learning
Intelligibility
Measurement
multi-objective learning
non-intrusive speech assessment models
Predictive models
Psychoacoustic models
Quality assessment
Representations
Speech
Speech enhancement
Speech processing
title Deep Learning-Based Non-Intrusive Multi-Objective Speech Assessment Model With Cross-Domain Features
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-28T16%3A46%3A19IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Deep%20Learning-Based%20Non-Intrusive%20Multi-Objective%20Speech%20Assessment%20Model%20With%20Cross-Domain%20Features&rft.jtitle=IEEE/ACM%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Zezario,%20Ryandhimas%20E.&rft.date=2023&rft.volume=31&rft.spage=54&rft.epage=70&rft.pages=54-70&rft.issn=2329-9290&rft.eissn=2329-9304&rft.coden=ITASFA&rft_id=info:doi/10.1109/TASLP.2022.3205757&rft_dat=%3Cproquest_cross%3E2747609114%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2747609114&rft_id=info:pmid/&rft_ieee_id=9905733&rfr_iscdi=true