Audio-Visual Speech Enhancement Using Self-supervised Learning to Improve Speech Intelligibility in Cochlear Implant Simulations

Individuals with hearing impairments face challenges in their ability to comprehend speech, particularly in noisy environments. The aim of this study is to explore the effectiveness of audio-visual speech enhancement (AVSE) in enhancing the intelligibility of vocoded speech in cochlear implant (CI)...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:arXiv.org 2023-07
Hauptverfasser: Richard Lee Lai, Jen-Cheng, Hou, Gogate, Mandar, Dashtipour, Kia, Hussain, Amir, Tsao, Yu
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title arXiv.org
container_volume
creator Richard Lee Lai
Jen-Cheng, Hou
Gogate, Mandar
Dashtipour, Kia
Hussain, Amir
Tsao, Yu
description Individuals with hearing impairments face challenges in their ability to comprehend speech, particularly in noisy environments. The aim of this study is to explore the effectiveness of audio-visual speech enhancement (AVSE) in enhancing the intelligibility of vocoded speech in cochlear implant (CI) simulations. Notably, the study focuses on a challenged scenario where there is limited availability of training data for the AVSE task. To address this problem, we propose a novel deep neural network framework termed Self-Supervised Learning-based AVSE (SSL-AVSE). The proposed SSL-AVSE combines visual cues, such as lip and mouth movements, from the target speakers with corresponding audio signals. The contextually combined audio and visual data are then fed into a Transformer-based SSL AV-HuBERT model to extract features, which are further processed using a BLSTM-based SE model. The results demonstrate several key findings. Firstly, SSL-AVSE successfully overcomes the issue of limited data by leveraging the AV-HuBERT model. Secondly, by fine-tuning the AV-HuBERT model parameters for the target SE task, significant performance improvements are achieved. Specifically, there is a notable enhancement in PESQ (Perceptual Evaluation of Speech Quality) from 1.43 to 1.67 and in STOI (Short-Time Objective Intelligibility) from 0.70 to 0.74. Furthermore, the performance of the SSL-AVSE was evaluated using CI vocoded speech to assess the intelligibility for CI users. Comparative experimental outcomes reveal that in the presence of dynamic noises encountered during human conversations, SSL-AVSE exhibits a substantial improvement. The NCM (Normal Correlation Matrix) values indicate an increase of 26.5% to 87.2% compared to the noisy baseline.
format Article
fullrecord <record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2838871130</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2838871130</sourcerecordid><originalsourceid>FETCH-proquest_journals_28388711303</originalsourceid><addsrcrecordid>eNqNjMuKwkAQRZuBgRHHf2iYdSDp9pHtIIqCu6hbaWNpSirVMdUtuPPTjaB7V3dxzrlfqmeszZJ8aMyPGoic0zQ144kZjWxP3f_jAX2yRYmOdNEAlJWeceW4hBo46I0gn3QBdEwkNtBeUeCgV-BafoLg9bJuWn-Fd7zkAER4wj0ShptG1lNfVtQVT5Vcd1pgHckF9Cy_6vvoSGDw2r76m8_W00XSnV4iSNidfWy5QzuT2zyfZJlN7WfWA0x3US8</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2838871130</pqid></control><display><type>article</type><title>Audio-Visual Speech Enhancement Using Self-supervised Learning to Improve Speech Intelligibility in Cochlear Implant Simulations</title><source>Free E- Journals</source><creator>Richard Lee Lai ; Jen-Cheng, Hou ; Gogate, Mandar ; Dashtipour, Kia ; Hussain, Amir ; Tsao, Yu</creator><creatorcontrib>Richard Lee Lai ; Jen-Cheng, Hou ; Gogate, Mandar ; Dashtipour, Kia ; Hussain, Amir ; Tsao, Yu</creatorcontrib><description>Individuals with hearing impairments face challenges in their ability to comprehend speech, particularly in noisy environments. The aim of this study is to explore the effectiveness of audio-visual speech enhancement (AVSE) in enhancing the intelligibility of vocoded speech in cochlear implant (CI) simulations. Notably, the study focuses on a challenged scenario where there is limited availability of training data for the AVSE task. To address this problem, we propose a novel deep neural network framework termed Self-Supervised Learning-based AVSE (SSL-AVSE). The proposed SSL-AVSE combines visual cues, such as lip and mouth movements, from the target speakers with corresponding audio signals. The contextually combined audio and visual data are then fed into a Transformer-based SSL AV-HuBERT model to extract features, which are further processed using a BLSTM-based SE model. The results demonstrate several key findings. Firstly, SSL-AVSE successfully overcomes the issue of limited data by leveraging the AV-HuBERT model. Secondly, by fine-tuning the AV-HuBERT model parameters for the target SE task, significant performance improvements are achieved. Specifically, there is a notable enhancement in PESQ (Perceptual Evaluation of Speech Quality) from 1.43 to 1.67 and in STOI (Short-Time Objective Intelligibility) from 0.70 to 0.74. Furthermore, the performance of the SSL-AVSE was evaluated using CI vocoded speech to assess the intelligibility for CI users. Comparative experimental outcomes reveal that in the presence of dynamic noises encountered during human conversations, SSL-AVSE exhibits a substantial improvement. The NCM (Normal Correlation Matrix) values indicate an increase of 26.5% to 87.2% compared to the noisy baseline.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Artificial neural networks ; Audio data ; Audio signals ; Audio visual equipment ; Cochlear implants ; Correlation analysis ; Intelligibility ; Machine learning ; Self-supervised learning ; Speech ; Speech processing ; Visual signals</subject><ispartof>arXiv.org, 2023-07</ispartof><rights>2023. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Richard Lee Lai</creatorcontrib><creatorcontrib>Jen-Cheng, Hou</creatorcontrib><creatorcontrib>Gogate, Mandar</creatorcontrib><creatorcontrib>Dashtipour, Kia</creatorcontrib><creatorcontrib>Hussain, Amir</creatorcontrib><creatorcontrib>Tsao, Yu</creatorcontrib><title>Audio-Visual Speech Enhancement Using Self-supervised Learning to Improve Speech Intelligibility in Cochlear Implant Simulations</title><title>arXiv.org</title><description>Individuals with hearing impairments face challenges in their ability to comprehend speech, particularly in noisy environments. The aim of this study is to explore the effectiveness of audio-visual speech enhancement (AVSE) in enhancing the intelligibility of vocoded speech in cochlear implant (CI) simulations. Notably, the study focuses on a challenged scenario where there is limited availability of training data for the AVSE task. To address this problem, we propose a novel deep neural network framework termed Self-Supervised Learning-based AVSE (SSL-AVSE). The proposed SSL-AVSE combines visual cues, such as lip and mouth movements, from the target speakers with corresponding audio signals. The contextually combined audio and visual data are then fed into a Transformer-based SSL AV-HuBERT model to extract features, which are further processed using a BLSTM-based SE model. The results demonstrate several key findings. Firstly, SSL-AVSE successfully overcomes the issue of limited data by leveraging the AV-HuBERT model. Secondly, by fine-tuning the AV-HuBERT model parameters for the target SE task, significant performance improvements are achieved. Specifically, there is a notable enhancement in PESQ (Perceptual Evaluation of Speech Quality) from 1.43 to 1.67 and in STOI (Short-Time Objective Intelligibility) from 0.70 to 0.74. Furthermore, the performance of the SSL-AVSE was evaluated using CI vocoded speech to assess the intelligibility for CI users. Comparative experimental outcomes reveal that in the presence of dynamic noises encountered during human conversations, SSL-AVSE exhibits a substantial improvement. The NCM (Normal Correlation Matrix) values indicate an increase of 26.5% to 87.2% compared to the noisy baseline.</description><subject>Artificial neural networks</subject><subject>Audio data</subject><subject>Audio signals</subject><subject>Audio visual equipment</subject><subject>Cochlear implants</subject><subject>Correlation analysis</subject><subject>Intelligibility</subject><subject>Machine learning</subject><subject>Self-supervised learning</subject><subject>Speech</subject><subject>Speech processing</subject><subject>Visual signals</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNqNjMuKwkAQRZuBgRHHf2iYdSDp9pHtIIqCu6hbaWNpSirVMdUtuPPTjaB7V3dxzrlfqmeszZJ8aMyPGoic0zQ144kZjWxP3f_jAX2yRYmOdNEAlJWeceW4hBo46I0gn3QBdEwkNtBeUeCgV-BafoLg9bJuWn-Fd7zkAER4wj0ShptG1lNfVtQVT5Vcd1pgHckF9Cy_6vvoSGDw2r76m8_W00XSnV4iSNidfWy5QzuT2zyfZJlN7WfWA0x3US8</recordid><startdate>20230715</startdate><enddate>20230715</enddate><creator>Richard Lee Lai</creator><creator>Jen-Cheng, Hou</creator><creator>Gogate, Mandar</creator><creator>Dashtipour, Kia</creator><creator>Hussain, Amir</creator><creator>Tsao, Yu</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20230715</creationdate><title>Audio-Visual Speech Enhancement Using Self-supervised Learning to Improve Speech Intelligibility in Cochlear Implant Simulations</title><author>Richard Lee Lai ; Jen-Cheng, Hou ; Gogate, Mandar ; Dashtipour, Kia ; Hussain, Amir ; Tsao, Yu</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28388711303</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Artificial neural networks</topic><topic>Audio data</topic><topic>Audio signals</topic><topic>Audio visual equipment</topic><topic>Cochlear implants</topic><topic>Correlation analysis</topic><topic>Intelligibility</topic><topic>Machine learning</topic><topic>Self-supervised learning</topic><topic>Speech</topic><topic>Speech processing</topic><topic>Visual signals</topic><toplevel>online_resources</toplevel><creatorcontrib>Richard Lee Lai</creatorcontrib><creatorcontrib>Jen-Cheng, Hou</creatorcontrib><creatorcontrib>Gogate, Mandar</creatorcontrib><creatorcontrib>Dashtipour, Kia</creatorcontrib><creatorcontrib>Hussain, Amir</creatorcontrib><creatorcontrib>Tsao, Yu</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection (ProQuest)</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Richard Lee Lai</au><au>Jen-Cheng, Hou</au><au>Gogate, Mandar</au><au>Dashtipour, Kia</au><au>Hussain, Amir</au><au>Tsao, Yu</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>Audio-Visual Speech Enhancement Using Self-supervised Learning to Improve Speech Intelligibility in Cochlear Implant Simulations</atitle><jtitle>arXiv.org</jtitle><date>2023-07-15</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>Individuals with hearing impairments face challenges in their ability to comprehend speech, particularly in noisy environments. The aim of this study is to explore the effectiveness of audio-visual speech enhancement (AVSE) in enhancing the intelligibility of vocoded speech in cochlear implant (CI) simulations. Notably, the study focuses on a challenged scenario where there is limited availability of training data for the AVSE task. To address this problem, we propose a novel deep neural network framework termed Self-Supervised Learning-based AVSE (SSL-AVSE). The proposed SSL-AVSE combines visual cues, such as lip and mouth movements, from the target speakers with corresponding audio signals. The contextually combined audio and visual data are then fed into a Transformer-based SSL AV-HuBERT model to extract features, which are further processed using a BLSTM-based SE model. The results demonstrate several key findings. Firstly, SSL-AVSE successfully overcomes the issue of limited data by leveraging the AV-HuBERT model. Secondly, by fine-tuning the AV-HuBERT model parameters for the target SE task, significant performance improvements are achieved. Specifically, there is a notable enhancement in PESQ (Perceptual Evaluation of Speech Quality) from 1.43 to 1.67 and in STOI (Short-Time Objective Intelligibility) from 0.70 to 0.74. Furthermore, the performance of the SSL-AVSE was evaluated using CI vocoded speech to assess the intelligibility for CI users. Comparative experimental outcomes reveal that in the presence of dynamic noises encountered during human conversations, SSL-AVSE exhibits a substantial improvement. The NCM (Normal Correlation Matrix) values indicate an increase of 26.5% to 87.2% compared to the noisy baseline.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier EISSN: 2331-8422
ispartof arXiv.org, 2023-07
issn 2331-8422
language eng
recordid cdi_proquest_journals_2838871130
source Free E- Journals
subjects Artificial neural networks
Audio data
Audio signals
Audio visual equipment
Cochlear implants
Correlation analysis
Intelligibility
Machine learning
Self-supervised learning
Speech
Speech processing
Visual signals
title Audio-Visual Speech Enhancement Using Self-supervised Learning to Improve Speech Intelligibility in Cochlear Implant Simulations
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-09T06%3A57%3A05IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=Audio-Visual%20Speech%20Enhancement%20Using%20Self-supervised%20Learning%20to%20Improve%20Speech%20Intelligibility%20in%20Cochlear%20Implant%20Simulations&rft.jtitle=arXiv.org&rft.au=Richard%20Lee%20Lai&rft.date=2023-07-15&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2838871130%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2838871130&rft_id=info:pmid/&rfr_iscdi=true