MLNET: An Adaptive Multiple Receptive-field Attention Neural Network for Voice Activity Detection

Voice activity detection (VAD) makes a distinction between speech and non-speech and its performance is of crucial importance for speech based services. Recently, deep neural network (DNN)-based VADs have achieved better performance than conventional signal processing methods. The existed DNNbased m...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Zheng, Zhenpeng, Wang, Jianzong, Cheng, Ning, Luo, Jian, Xiao, Jing
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Learning Computer Science - Sound
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Zheng, Zhenpeng Wang, Jianzong Cheng, Ning Luo, Jian Xiao, Jing
description	Voice activity detection (VAD) makes a distinction between speech and non-speech and its performance is of crucial importance for speech based services. Recently, deep neural network (DNN)-based VADs have achieved better performance than conventional signal processing methods. The existed DNNbased models always handcrafted a fixed window to make use of the contextual speech information to improve the performance of VAD. However, the fixed window of contextual speech information can't handle various unpredicatable noise environments and highlight the critical speech information to VAD task. In order to solve this problem, this paper proposed an adaptive multiple receptive-field attention neural network, called MLNET, to finish VAD task. The MLNET leveraged multi-branches to extract multiple contextual speech information and investigated an effective attention block to weight the most crucial parts of the context for final classification. Experiments in real-world scenarios demonstrated that the proposed MLNET-based model outperformed other baselines.
doi_str_mv	10.48550/arxiv.2008.05650
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2008_05650</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2008_05650</sourcerecordid><originalsourceid>FETCH-LOGICAL-a670-a4d74fa95c46c01dffee2e4261471d95ca4297d250dfab11faacff2c8d9855673</originalsourceid><addsrcrecordid>eNotj8tKAzEYRrNxIdUHcGVeYMYkTebiLtR6gWkFGbod_iZ_IBhnhphW-_ZOp64OHD4-OITccZbLSin2APHXH3PBWJUzVSh2TWDTbNftI9U91RbG5I9IN4eQ_BiQfqDBWWXOY7BUp4R98kNPt3iIECaknyF-UjdEuhu8QarNtPfpRJ8woTlvb8iVg_CNt_9ckPZ53a5es-b95W2lmwyKkmUgbSkd1MrIwjBunUMUKEXBZcntpEGKurRCMetgz7kDMM4JU9l6KivK5YLcX27nxm6M_gviqTu3dnPr8g_RTE_r</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>MLNET: An Adaptive Multiple Receptive-field Attention Neural Network for Voice Activity Detection</title><source>arXiv.org</source><creator>Zheng, Zhenpeng ; Wang, Jianzong ; Cheng, Ning ; Luo, Jian ; Xiao, Jing</creator><creatorcontrib>Zheng, Zhenpeng ; Wang, Jianzong ; Cheng, Ning ; Luo, Jian ; Xiao, Jing</creatorcontrib><description>Voice activity detection (VAD) makes a distinction between speech and non-speech and its performance is of crucial importance for speech based services. Recently, deep neural network (DNN)-based VADs have achieved better performance than conventional signal processing methods. The existed DNNbased models always handcrafted a fixed window to make use of the contextual speech information to improve the performance of VAD. However, the fixed window of contextual speech information can't handle various unpredicatable noise environments and highlight the critical speech information to VAD task. In order to solve this problem, this paper proposed an adaptive multiple receptive-field attention neural network, called MLNET, to finish VAD task. The MLNET leveraged multi-branches to extract multiple contextual speech information and investigated an effective attention block to weight the most crucial parts of the context for final classification. Experiments in real-world scenarios demonstrated that the proposed MLNET-based model outperformed other baselines.</description><identifier>DOI: 10.48550/arxiv.2008.05650</identifier><language>eng</language><subject>Computer Science - Learning ; Computer Science - Sound</subject><creationdate>2020-08</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2008.05650$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2008.05650$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zheng, Zhenpeng</creatorcontrib><creatorcontrib>Wang, Jianzong</creatorcontrib><creatorcontrib>Cheng, Ning</creatorcontrib><creatorcontrib>Luo, Jian</creatorcontrib><creatorcontrib>Xiao, Jing</creatorcontrib><title>MLNET: An Adaptive Multiple Receptive-field Attention Neural Network for Voice Activity Detection</title><description>Voice activity detection (VAD) makes a distinction between speech and non-speech and its performance is of crucial importance for speech based services. Recently, deep neural network (DNN)-based VADs have achieved better performance than conventional signal processing methods. The existed DNNbased models always handcrafted a fixed window to make use of the contextual speech information to improve the performance of VAD. However, the fixed window of contextual speech information can't handle various unpredicatable noise environments and highlight the critical speech information to VAD task. In order to solve this problem, this paper proposed an adaptive multiple receptive-field attention neural network, called MLNET, to finish VAD task. The MLNET leveraged multi-branches to extract multiple contextual speech information and investigated an effective attention block to weight the most crucial parts of the context for final classification. Experiments in real-world scenarios demonstrated that the proposed MLNET-based model outperformed other baselines.</description><subject>Computer Science - Learning</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tKAzEYRrNxIdUHcGVeYMYkTebiLtR6gWkFGbod_iZ_IBhnhphW-_ZOp64OHD4-OITccZbLSin2APHXH3PBWJUzVSh2TWDTbNftI9U91RbG5I9IN4eQ_BiQfqDBWWXOY7BUp4R98kNPt3iIECaknyF-UjdEuhu8QarNtPfpRJ8woTlvb8iVg_CNt_9ckPZ53a5es-b95W2lmwyKkmUgbSkd1MrIwjBunUMUKEXBZcntpEGKurRCMetgz7kDMM4JU9l6KivK5YLcX27nxm6M_gviqTu3dnPr8g_RTE_r</recordid><startdate>20200812</startdate><enddate>20200812</enddate><creator>Zheng, Zhenpeng</creator><creator>Wang, Jianzong</creator><creator>Cheng, Ning</creator><creator>Luo, Jian</creator><creator>Xiao, Jing</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20200812</creationdate><title>MLNET: An Adaptive Multiple Receptive-field Attention Neural Network for Voice Activity Detection</title><author>Zheng, Zhenpeng ; Wang, Jianzong ; Cheng, Ning ; Luo, Jian ; Xiao, Jing</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a670-a4d74fa95c46c01dffee2e4261471d95ca4297d250dfab11faacff2c8d9855673</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Computer Science - Learning</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Zheng, Zhenpeng</creatorcontrib><creatorcontrib>Wang, Jianzong</creatorcontrib><creatorcontrib>Cheng, Ning</creatorcontrib><creatorcontrib>Luo, Jian</creatorcontrib><creatorcontrib>Xiao, Jing</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zheng, Zhenpeng</au><au>Wang, Jianzong</au><au>Cheng, Ning</au><au>Luo, Jian</au><au>Xiao, Jing</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>MLNET: An Adaptive Multiple Receptive-field Attention Neural Network for Voice Activity Detection</atitle><date>2020-08-12</date><risdate>2020</risdate><abstract>Voice activity detection (VAD) makes a distinction between speech and non-speech and its performance is of crucial importance for speech based services. Recently, deep neural network (DNN)-based VADs have achieved better performance than conventional signal processing methods. The existed DNNbased models always handcrafted a fixed window to make use of the contextual speech information to improve the performance of VAD. However, the fixed window of contextual speech information can't handle various unpredicatable noise environments and highlight the critical speech information to VAD task. In order to solve this problem, this paper proposed an adaptive multiple receptive-field attention neural network, called MLNET, to finish VAD task. The MLNET leveraged multi-branches to extract multiple contextual speech information and investigated an effective attention block to weight the most crucial parts of the context for final classification. Experiments in real-world scenarios demonstrated that the proposed MLNET-based model outperformed other baselines.</abstract><doi>10.48550/arxiv.2008.05650</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2008.05650
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2008_05650
source	arXiv.org
subjects	Computer Science - Learning Computer Science - Sound
title	MLNET: An Adaptive Multiple Receptive-field Attention Neural Network for Voice Activity Detection
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-29T19%3A49%3A29IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=MLNET:%20An%20Adaptive%20Multiple%20Receptive-field%20Attention%20Neural%20Network%20for%20Voice%20Activity%20Detection&rft.au=Zheng,%20Zhenpeng&rft.date=2020-08-12&rft_id=info:doi/10.48550/arxiv.2008.05650&rft_dat=%3Carxiv_GOX%3E2008_05650%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true