MLNET: An Adaptive Multiple Receptive-field Attention Neural Network for Voice Activity Detection
Voice activity detection (VAD) makes a distinction between speech and non-speech and its performance is of crucial importance for speech based services. Recently, deep neural network (DNN)-based VADs have achieved better performance than conventional signal processing methods. The existed DNNbased m...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Zheng, Zhenpeng Wang, Jianzong Cheng, Ning Luo, Jian Xiao, Jing |
description | Voice activity detection (VAD) makes a distinction between speech and
non-speech and its performance is of crucial importance for speech based
services. Recently, deep neural network (DNN)-based VADs have achieved better
performance than conventional signal processing methods. The existed DNNbased
models always handcrafted a fixed window to make use of the contextual speech
information to improve the performance of VAD. However, the fixed window of
contextual speech information can't handle various unpredicatable noise
environments and highlight the critical speech information to VAD task. In
order to solve this problem, this paper proposed an adaptive multiple
receptive-field attention neural network, called MLNET, to finish VAD task. The
MLNET leveraged multi-branches to extract multiple contextual speech
information and investigated an effective attention block to weight the most
crucial parts of the context for final classification. Experiments in
real-world scenarios demonstrated that the proposed MLNET-based model
outperformed other baselines. |
doi_str_mv | 10.48550/arxiv.2008.05650 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2008_05650</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2008_05650</sourcerecordid><originalsourceid>FETCH-LOGICAL-a670-a4d74fa95c46c01dffee2e4261471d95ca4297d250dfab11faacff2c8d9855673</originalsourceid><addsrcrecordid>eNotj8tKAzEYRrNxIdUHcGVeYMYkTebiLtR6gWkFGbod_iZ_IBhnhphW-_ZOp64OHD4-OITccZbLSin2APHXH3PBWJUzVSh2TWDTbNftI9U91RbG5I9IN4eQ_BiQfqDBWWXOY7BUp4R98kNPt3iIECaknyF-UjdEuhu8QarNtPfpRJ8woTlvb8iVg_CNt_9ckPZ53a5es-b95W2lmwyKkmUgbSkd1MrIwjBunUMUKEXBZcntpEGKurRCMetgz7kDMM4JU9l6KivK5YLcX27nxm6M_gviqTu3dnPr8g_RTE_r</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>MLNET: An Adaptive Multiple Receptive-field Attention Neural Network for Voice Activity Detection</title><source>arXiv.org</source><creator>Zheng, Zhenpeng ; Wang, Jianzong ; Cheng, Ning ; Luo, Jian ; Xiao, Jing</creator><creatorcontrib>Zheng, Zhenpeng ; Wang, Jianzong ; Cheng, Ning ; Luo, Jian ; Xiao, Jing</creatorcontrib><description>Voice activity detection (VAD) makes a distinction between speech and
non-speech and its performance is of crucial importance for speech based
services. Recently, deep neural network (DNN)-based VADs have achieved better
performance than conventional signal processing methods. The existed DNNbased
models always handcrafted a fixed window to make use of the contextual speech
information to improve the performance of VAD. However, the fixed window of
contextual speech information can't handle various unpredicatable noise
environments and highlight the critical speech information to VAD task. In
order to solve this problem, this paper proposed an adaptive multiple
receptive-field attention neural network, called MLNET, to finish VAD task. The
MLNET leveraged multi-branches to extract multiple contextual speech
information and investigated an effective attention block to weight the most
crucial parts of the context for final classification. Experiments in
real-world scenarios demonstrated that the proposed MLNET-based model
outperformed other baselines.</description><identifier>DOI: 10.48550/arxiv.2008.05650</identifier><language>eng</language><subject>Computer Science - Learning ; Computer Science - Sound</subject><creationdate>2020-08</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2008.05650$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2008.05650$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Zheng, Zhenpeng</creatorcontrib><creatorcontrib>Wang, Jianzong</creatorcontrib><creatorcontrib>Cheng, Ning</creatorcontrib><creatorcontrib>Luo, Jian</creatorcontrib><creatorcontrib>Xiao, Jing</creatorcontrib><title>MLNET: An Adaptive Multiple Receptive-field Attention Neural Network for Voice Activity Detection</title><description>Voice activity detection (VAD) makes a distinction between speech and
non-speech and its performance is of crucial importance for speech based
services. Recently, deep neural network (DNN)-based VADs have achieved better
performance than conventional signal processing methods. The existed DNNbased
models always handcrafted a fixed window to make use of the contextual speech
information to improve the performance of VAD. However, the fixed window of
contextual speech information can't handle various unpredicatable noise
environments and highlight the critical speech information to VAD task. In
order to solve this problem, this paper proposed an adaptive multiple
receptive-field attention neural network, called MLNET, to finish VAD task. The
MLNET leveraged multi-branches to extract multiple contextual speech
information and investigated an effective attention block to weight the most
crucial parts of the context for final classification. Experiments in
real-world scenarios demonstrated that the proposed MLNET-based model
outperformed other baselines.</description><subject>Computer Science - Learning</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2020</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tKAzEYRrNxIdUHcGVeYMYkTebiLtR6gWkFGbod_iZ_IBhnhphW-_ZOp64OHD4-OITccZbLSin2APHXH3PBWJUzVSh2TWDTbNftI9U91RbG5I9IN4eQ_BiQfqDBWWXOY7BUp4R98kNPt3iIECaknyF-UjdEuhu8QarNtPfpRJ8woTlvb8iVg_CNt_9ckPZ53a5es-b95W2lmwyKkmUgbSkd1MrIwjBunUMUKEXBZcntpEGKurRCMetgz7kDMM4JU9l6KivK5YLcX27nxm6M_gviqTu3dnPr8g_RTE_r</recordid><startdate>20200812</startdate><enddate>20200812</enddate><creator>Zheng, Zhenpeng</creator><creator>Wang, Jianzong</creator><creator>Cheng, Ning</creator><creator>Luo, Jian</creator><creator>Xiao, Jing</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20200812</creationdate><title>MLNET: An Adaptive Multiple Receptive-field Attention Neural Network for Voice Activity Detection</title><author>Zheng, Zhenpeng ; Wang, Jianzong ; Cheng, Ning ; Luo, Jian ; Xiao, Jing</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a670-a4d74fa95c46c01dffee2e4261471d95ca4297d250dfab11faacff2c8d9855673</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2020</creationdate><topic>Computer Science - Learning</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Zheng, Zhenpeng</creatorcontrib><creatorcontrib>Wang, Jianzong</creatorcontrib><creatorcontrib>Cheng, Ning</creatorcontrib><creatorcontrib>Luo, Jian</creatorcontrib><creatorcontrib>Xiao, Jing</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Zheng, Zhenpeng</au><au>Wang, Jianzong</au><au>Cheng, Ning</au><au>Luo, Jian</au><au>Xiao, Jing</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>MLNET: An Adaptive Multiple Receptive-field Attention Neural Network for Voice Activity Detection</atitle><date>2020-08-12</date><risdate>2020</risdate><abstract>Voice activity detection (VAD) makes a distinction between speech and
non-speech and its performance is of crucial importance for speech based
services. Recently, deep neural network (DNN)-based VADs have achieved better
performance than conventional signal processing methods. The existed DNNbased
models always handcrafted a fixed window to make use of the contextual speech
information to improve the performance of VAD. However, the fixed window of
contextual speech information can't handle various unpredicatable noise
environments and highlight the critical speech information to VAD task. In
order to solve this problem, this paper proposed an adaptive multiple
receptive-field attention neural network, called MLNET, to finish VAD task. The
MLNET leveraged multi-branches to extract multiple contextual speech
information and investigated an effective attention block to weight the most
crucial parts of the context for final classification. Experiments in
real-world scenarios demonstrated that the proposed MLNET-based model
outperformed other baselines.</abstract><doi>10.48550/arxiv.2008.05650</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2008.05650 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2008_05650 |
source | arXiv.org |
subjects | Computer Science - Learning Computer Science - Sound |
title | MLNET: An Adaptive Multiple Receptive-field Attention Neural Network for Voice Activity Detection |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-29T19%3A49%3A29IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=MLNET:%20An%20Adaptive%20Multiple%20Receptive-field%20Attention%20Neural%20Network%20for%20Voice%20Activity%20Detection&rft.au=Zheng,%20Zhenpeng&rft.date=2020-08-12&rft_id=info:doi/10.48550/arxiv.2008.05650&rft_dat=%3Carxiv_GOX%3E2008_05650%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |