Environmental Sound Classification With Low-Complexity Convolutional Neural Network Empowered by Sparse Salient Region Pooling

Environmental Sound Classification (ESC) is an important field in a broad range of applications, such as smart cities, audio surveillance, and health care. Recently, Convolutional Neural Networks (CNNs) have taken the lead from traditional approaches and have produced promising results. However, the...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE access 2023, Vol.11, p.849-862
Hauptverfasser:	Seresht, Hamed Riazati, Mohammadi, Karim
Format:	Artikel
Sprache:	eng
Schlagworte:	Accuracy Artificial neural networks Background noise Classification Complexity Computational modeling Computer architecture Convolutional neural networks environmental sound classification Feature extraction global feature pooling low complexity Neural networks regional saliency Spectrogram Time-frequency analysis Training data
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	862
container_issue
container_start_page	849
container_title	IEEE access
container_volume	11
creator	Seresht, Hamed Riazati Mohammadi, Karim
description	Environmental Sound Classification (ESC) is an important field in a broad range of applications, such as smart cities, audio surveillance, and health care. Recently, Convolutional Neural Networks (CNNs) have taken the lead from traditional approaches and have produced promising results. However, the achieved improvements are often accompanied by increasing depth, complexity, and size of the network, which prevents their usage in many practical applications. In this work, our goal is to empower a small-size low-complexity CNN model to achieve superior performance. To this end, we concentrate on the importance of global pooling technique, which is less investigated in ESC. In most previous works, models utilize global average pooling layer which does not consider regional saliency, and thus weakens the salient time-frequency regions contributions to the classification, and also to the training of convolutional kernels. We propose a novel global pooling method, called Sparse Salient Region Pooling (SSRP), which computes the channel descriptors using a sparse subset of features, and guides the model to effectively learn from the more salient time-frequency regions. Experimental results demonstrate that the proposed model with only 700K parameters yields accuracies of 86.7% on ESC-50 and 94.8% on ESC-10, which are comparable to that of the state-of-the-art methods. Compared to the baseline model, our model achieves absolute improvement of 21.8% in accuracy on ESC-50, with 98% smaller model size. Our visual analyses show that SSRP intensifies the responses of low-energy regions such that they contribute even more than high-energy regions to the classification of specific sound classes.
doi_str_mv	10.1109/ACCESS.2022.3232807
format	Article
fullrecord	<record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1109_ACCESS_2022_3232807</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10002350</ieee_id><doaj_id>oai_doaj_org_article_62a387fb9f24421cb81faccc963d183e</doaj_id><sourcerecordid>2761371936</sourcerecordid><originalsourceid>FETCH-LOGICAL-c409t-1b795f083ab96a86111e2c67c918754346541601322c8db3f243aba5e80198eb3</originalsourceid><addsrcrecordid>eNpNkUFv1DAQhSMEElXpL4CDJc5ZPHbi2McqWqDSChABcbQcZ7J4ycbBTrrshd-Ot6lQfRlr5r1vpHlZ9hroBoCqd7d1vW2aDaOMbTjjTNLqWXbFQKicl1w8f_J_md3EeKDpydQqq6vs73a8d8GPRxxnM5DGL2NH6sHE6Hpnzez8SH64-SfZ-VNe--M04B83n0ntx3s_LJd5sn3CJTyU-eTDL7I9Tv6EATvSnkkzmRCRNGZwaQf5ivsL84v3gxv3r7IXvRki3jzW6-z7--23-mO--_zhrr7d5bagas6hrVTZU8lNq4SRAgCQWVFZBbIqC16IsgBBgTNmZdfynhVJakqUFJTEll9ndyu38-agp-COJpy1N04_NHzYaxNmZwfUghkuq75VCVIwsK2E3lhrleAdSI6J9XZlTcH_XjDO-uCXkM4QNasE8AoUF0nFV5UNPsaA_f-tQPUlN73mpi-56cfckuvN6nKI-MRBKeMl5f8AxS-U5g</addsrcrecordid><sourcetype>Open Website</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2761371936</pqid></control><display><type>article</type><title>Environmental Sound Classification With Low-Complexity Convolutional Neural Network Empowered by Sparse Salient Region Pooling</title><source>IEEE Open Access Journals</source><source>DOAJ Directory of Open Access Journals</source><source>EZB-FREE-00999 freely available EZB journals</source><creator>Seresht, Hamed Riazati ; Mohammadi, Karim</creator><creatorcontrib>Seresht, Hamed Riazati ; Mohammadi, Karim</creatorcontrib><description>Environmental Sound Classification (ESC) is an important field in a broad range of applications, such as smart cities, audio surveillance, and health care. Recently, Convolutional Neural Networks (CNNs) have taken the lead from traditional approaches and have produced promising results. However, the achieved improvements are often accompanied by increasing depth, complexity, and size of the network, which prevents their usage in many practical applications. In this work, our goal is to empower a small-size low-complexity CNN model to achieve superior performance. To this end, we concentrate on the importance of global pooling technique, which is less investigated in ESC. In most previous works, models utilize global average pooling layer which does not consider regional saliency, and thus weakens the salient time-frequency regions contributions to the classification, and also to the training of convolutional kernels. We propose a novel global pooling method, called Sparse Salient Region Pooling (SSRP), which computes the channel descriptors using a sparse subset of features, and guides the model to effectively learn from the more salient time-frequency regions. Experimental results demonstrate that the proposed model with only 700K parameters yields accuracies of 86.7% on ESC-50 and 94.8% on ESC-10, which are comparable to that of the state-of-the-art methods. Compared to the baseline model, our model achieves absolute improvement of 21.8% in accuracy on ESC-50, with 98% smaller model size. Our visual analyses show that SSRP intensifies the responses of low-energy regions such that they contribute even more than high-energy regions to the classification of specific sound classes.</description><identifier>ISSN: 2169-3536</identifier><identifier>EISSN: 2169-3536</identifier><identifier>DOI: 10.1109/ACCESS.2022.3232807</identifier><identifier>CODEN: IAECCG</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Accuracy ; Artificial neural networks ; Background noise ; Classification ; Complexity ; Computational modeling ; Computer architecture ; Convolutional neural networks ; environmental sound classification ; Feature extraction ; global feature pooling ; low complexity ; Neural networks ; regional saliency ; Spectrogram ; Time-frequency analysis ; Training data</subject><ispartof>IEEE access, 2023, Vol.11, p.849-862</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2023</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c409t-1b795f083ab96a86111e2c67c918754346541601322c8db3f243aba5e80198eb3</citedby><cites>FETCH-LOGICAL-c409t-1b795f083ab96a86111e2c67c918754346541601322c8db3f243aba5e80198eb3</cites><orcidid>0000-0002-3849-3061</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10002350$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>315,782,786,866,2104,4026,27640,27930,27931,27932,54940</link.rule.ids></links><search><creatorcontrib>Seresht, Hamed Riazati</creatorcontrib><creatorcontrib>Mohammadi, Karim</creatorcontrib><title>Environmental Sound Classification With Low-Complexity Convolutional Neural Network Empowered by Sparse Salient Region Pooling</title><title>IEEE access</title><addtitle>Access</addtitle><description>Environmental Sound Classification (ESC) is an important field in a broad range of applications, such as smart cities, audio surveillance, and health care. Recently, Convolutional Neural Networks (CNNs) have taken the lead from traditional approaches and have produced promising results. However, the achieved improvements are often accompanied by increasing depth, complexity, and size of the network, which prevents their usage in many practical applications. In this work, our goal is to empower a small-size low-complexity CNN model to achieve superior performance. To this end, we concentrate on the importance of global pooling technique, which is less investigated in ESC. In most previous works, models utilize global average pooling layer which does not consider regional saliency, and thus weakens the salient time-frequency regions contributions to the classification, and also to the training of convolutional kernels. We propose a novel global pooling method, called Sparse Salient Region Pooling (SSRP), which computes the channel descriptors using a sparse subset of features, and guides the model to effectively learn from the more salient time-frequency regions. Experimental results demonstrate that the proposed model with only 700K parameters yields accuracies of 86.7% on ESC-50 and 94.8% on ESC-10, which are comparable to that of the state-of-the-art methods. Compared to the baseline model, our model achieves absolute improvement of 21.8% in accuracy on ESC-50, with 98% smaller model size. Our visual analyses show that SSRP intensifies the responses of low-energy regions such that they contribute even more than high-energy regions to the classification of specific sound classes.</description><subject>Accuracy</subject><subject>Artificial neural networks</subject><subject>Background noise</subject><subject>Classification</subject><subject>Complexity</subject><subject>Computational modeling</subject><subject>Computer architecture</subject><subject>Convolutional neural networks</subject><subject>environmental sound classification</subject><subject>Feature extraction</subject><subject>global feature pooling</subject><subject>low complexity</subject><subject>Neural networks</subject><subject>regional saliency</subject><subject>Spectrogram</subject><subject>Time-frequency analysis</subject><subject>Training data</subject><issn>2169-3536</issn><issn>2169-3536</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><sourceid>DOA</sourceid><recordid>eNpNkUFv1DAQhSMEElXpL4CDJc5ZPHbi2McqWqDSChABcbQcZ7J4ycbBTrrshd-Ot6lQfRlr5r1vpHlZ9hroBoCqd7d1vW2aDaOMbTjjTNLqWXbFQKicl1w8f_J_md3EeKDpydQqq6vs73a8d8GPRxxnM5DGL2NH6sHE6Hpnzez8SH64-SfZ-VNe--M04B83n0ntx3s_LJd5sn3CJTyU-eTDL7I9Tv6EATvSnkkzmRCRNGZwaQf5ivsL84v3gxv3r7IXvRki3jzW6-z7--23-mO--_zhrr7d5bagas6hrVTZU8lNq4SRAgCQWVFZBbIqC16IsgBBgTNmZdfynhVJakqUFJTEll9ndyu38-agp-COJpy1N04_NHzYaxNmZwfUghkuq75VCVIwsK2E3lhrleAdSI6J9XZlTcH_XjDO-uCXkM4QNasE8AoUF0nFV5UNPsaA_f-tQPUlN73mpi-56cfckuvN6nKI-MRBKeMl5f8AxS-U5g</recordid><startdate>2023</startdate><enddate>2023</enddate><creator>Seresht, Hamed Riazati</creator><creator>Mohammadi, Karim</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>7SR</scope><scope>8BQ</scope><scope>8FD</scope><scope>JG9</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>DOA</scope><orcidid>https://orcid.org/0000-0002-3849-3061</orcidid></search><sort><creationdate>2023</creationdate><title>Environmental Sound Classification With Low-Complexity Convolutional Neural Network Empowered by Sparse Salient Region Pooling</title><author>Seresht, Hamed Riazati ; Mohammadi, Karim</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c409t-1b795f083ab96a86111e2c67c918754346541601322c8db3f243aba5e80198eb3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Accuracy</topic><topic>Artificial neural networks</topic><topic>Background noise</topic><topic>Classification</topic><topic>Complexity</topic><topic>Computational modeling</topic><topic>Computer architecture</topic><topic>Convolutional neural networks</topic><topic>environmental sound classification</topic><topic>Feature extraction</topic><topic>global feature pooling</topic><topic>low complexity</topic><topic>Neural networks</topic><topic>regional saliency</topic><topic>Spectrogram</topic><topic>Time-frequency analysis</topic><topic>Training data</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Seresht, Hamed Riazati</creatorcontrib><creatorcontrib>Mohammadi, Karim</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Engineered Materials Abstracts</collection><collection>METADEX</collection><collection>Technology Research Database</collection><collection>Materials Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>DOAJ Directory of Open Access Journals</collection><jtitle>IEEE access</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Seresht, Hamed Riazati</au><au>Mohammadi, Karim</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Environmental Sound Classification With Low-Complexity Convolutional Neural Network Empowered by Sparse Salient Region Pooling</atitle><jtitle>IEEE access</jtitle><stitle>Access</stitle><date>2023</date><risdate>2023</risdate><volume>11</volume><spage>849</spage><epage>862</epage><pages>849-862</pages><issn>2169-3536</issn><eissn>2169-3536</eissn><coden>IAECCG</coden><abstract>Environmental Sound Classification (ESC) is an important field in a broad range of applications, such as smart cities, audio surveillance, and health care. Recently, Convolutional Neural Networks (CNNs) have taken the lead from traditional approaches and have produced promising results. However, the achieved improvements are often accompanied by increasing depth, complexity, and size of the network, which prevents their usage in many practical applications. In this work, our goal is to empower a small-size low-complexity CNN model to achieve superior performance. To this end, we concentrate on the importance of global pooling technique, which is less investigated in ESC. In most previous works, models utilize global average pooling layer which does not consider regional saliency, and thus weakens the salient time-frequency regions contributions to the classification, and also to the training of convolutional kernels. We propose a novel global pooling method, called Sparse Salient Region Pooling (SSRP), which computes the channel descriptors using a sparse subset of features, and guides the model to effectively learn from the more salient time-frequency regions. Experimental results demonstrate that the proposed model with only 700K parameters yields accuracies of 86.7% on ESC-50 and 94.8% on ESC-10, which are comparable to that of the state-of-the-art methods. Compared to the baseline model, our model achieves absolute improvement of 21.8% in accuracy on ESC-50, with 98% smaller model size. Our visual analyses show that SSRP intensifies the responses of low-energy regions such that they contribute even more than high-energy regions to the classification of specific sound classes.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/ACCESS.2022.3232807</doi><tpages>14</tpages><orcidid>https://orcid.org/0000-0002-3849-3061</orcidid><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	ISSN: 2169-3536
ispartof	IEEE access, 2023, Vol.11, p.849-862
issn	2169-3536 2169-3536
language	eng
recordid	cdi_crossref_primary_10_1109_ACCESS_2022_3232807
source	IEEE Open Access Journals; DOAJ Directory of Open Access Journals; EZB-FREE-00999 freely available EZB journals
subjects	Accuracy Artificial neural networks Background noise Classification Complexity Computational modeling Computer architecture Convolutional neural networks environmental sound classification Feature extraction global feature pooling low complexity Neural networks regional saliency Spectrogram Time-frequency analysis Training data
title	Environmental Sound Classification With Low-Complexity Convolutional Neural Network Empowered by Sparse Salient Region Pooling
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-04T19%3A16%3A14IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Environmental%20Sound%20Classification%20With%20Low-Complexity%20Convolutional%20Neural%20Network%20Empowered%20by%20Sparse%20Salient%20Region%20Pooling&rft.jtitle=IEEE%20access&rft.au=Seresht,%20Hamed%20Riazati&rft.date=2023&rft.volume=11&rft.spage=849&rft.epage=862&rft.pages=849-862&rft.issn=2169-3536&rft.eissn=2169-3536&rft.coden=IAECCG&rft_id=info:doi/10.1109/ACCESS.2022.3232807&rft_dat=%3Cproquest_cross%3E2761371936%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2761371936&rft_id=info:pmid/&rft_ieee_id=10002350&rft_doaj_id=oai_doaj_org_article_62a387fb9f24421cb81faccc963d183e&rfr_iscdi=true