Convolutional Neural Network and Feature Transformation for Distant Speech Recognition

In many applications, speech recognition must operate in conditions where there are some distances between speakers and the microphones. This is called distant speech recognition (DSR). In this condition, speech recognition must deal with reverberation. Nowadays, deep learning technologies are becom...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:International journal of electrical and computer engineering (Malacca, Malacca) Malacca), 2018-12, Vol.8 (6), p.5381
Hauptverfasser: Pardede, Hilman F., Yuliani, Asri R., Sustika, Rika
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue 6
container_start_page 5381
container_title International journal of electrical and computer engineering (Malacca, Malacca)
container_volume 8
creator Pardede, Hilman F.
Yuliani, Asri R.
Sustika, Rika
description In many applications, speech recognition must operate in conditions where there are some distances between speakers and the microphones. This is called distant speech recognition (DSR). In this condition, speech recognition must deal with reverberation. Nowadays, deep learning technologies are becoming the the main technologies for speech recognition. Deep Neural Network (DNN) in hybrid with Hidden Markov Model (HMM) is the commonly used architecture. However, this system is still not robust against reverberation. Previous studies use Convolutional Neural Networks (CNN), which is a variation of neural network, to improve the robustness of speech recognition against noise. CNN has the properties of pooling which is used to find local correlation between neighboring dimensions in the features. With this property, CNN could be used as feature learning emphasizing the information on neighboring frames. In this study we use CNN to deal with reverberation. We also propose to use feature transformation techniques: linear discriminat analysis (LDA) and maximum likelihood linear transformation (MLLT), on mel frequency cepstral coefficient (MFCC) before feeding them to CNN. We argue that transforming features could produce more discriminative features for CNN, and hence improve the robustness of speech recognition against reverberation. Our evaluations on Meeting Recorder Digits (MRD) subset of Aurora-5 database confirm that the use of LDA and MLLT transformations improve the robustness of speech recognition. It is better by 20% relative error reduction on compared to a standard DNN based speech recognition using the same number of hidden layers.
doi_str_mv 10.11591/ijece.v8i6.pp5381-5388
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_proquest_journals_2211002927</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2211002927</sourcerecordid><originalsourceid>FETCH-LOGICAL-c146t-71c537644996f3c32c3803a3047297a71eeebe0fc90fa51c7ef0587a7b425f003</originalsourceid><addsrcrecordid>eNpNkE9LAzEQxYMoWGo_gwHPW_Nns0mOUq0KRUGr15DGiW5tN2uSrfjt3W09OId5D95jYH4InVMypVRoelmvwcF0p-pq2raCK1r0Sx2hESNKFUoSdfzPn6JJSmvSj6oqpsUIvc5CswubLtehsRv8AF3cS_4O8RPb5g3PweYuAl5G2yQf4tYOXdw7fF2nbJuMn1sA94GfwIX3ph7iM3Ti7SbB5E_H6GV-s5zdFYvH2_vZ1aJwtKxyIakTXFZlqXXluePMcUW45aSUTEsrKQCsgHinibeCOgmeCNUHq5IJTwgfo4vD3TaGrw5SNuvQxf6TZBijlBCmmexb8tByMaQUwZs21lsbfwwlZs_R7DmagaM5cDQDR_4Lg_JpfQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2211002927</pqid></control><display><type>article</type><title>Convolutional Neural Network and Feature Transformation for Distant Speech Recognition</title><source>Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals</source><creator>Pardede, Hilman F. ; Yuliani, Asri R. ; Sustika, Rika</creator><creatorcontrib>Pardede, Hilman F. ; Yuliani, Asri R. ; Sustika, Rika</creatorcontrib><description>In many applications, speech recognition must operate in conditions where there are some distances between speakers and the microphones. This is called distant speech recognition (DSR). In this condition, speech recognition must deal with reverberation. Nowadays, deep learning technologies are becoming the the main technologies for speech recognition. Deep Neural Network (DNN) in hybrid with Hidden Markov Model (HMM) is the commonly used architecture. However, this system is still not robust against reverberation. Previous studies use Convolutional Neural Networks (CNN), which is a variation of neural network, to improve the robustness of speech recognition against noise. CNN has the properties of pooling which is used to find local correlation between neighboring dimensions in the features. With this property, CNN could be used as feature learning emphasizing the information on neighboring frames. In this study we use CNN to deal with reverberation. We also propose to use feature transformation techniques: linear discriminat analysis (LDA) and maximum likelihood linear transformation (MLLT), on mel frequency cepstral coefficient (MFCC) before feeding them to CNN. We argue that transforming features could produce more discriminative features for CNN, and hence improve the robustness of speech recognition against reverberation. Our evaluations on Meeting Recorder Digits (MRD) subset of Aurora-5 database confirm that the use of LDA and MLLT transformations improve the robustness of speech recognition. It is better by 20% relative error reduction on compared to a standard DNN based speech recognition using the same number of hidden layers.</description><identifier>ISSN: 2088-8708</identifier><identifier>EISSN: 2088-8708</identifier><identifier>DOI: 10.11591/ijece.v8i6.pp5381-5388</identifier><language>eng</language><publisher>Yogyakarta: IAES Institute of Advanced Engineering and Science</publisher><subject>Artificial neural networks ; Digits ; Error reduction ; Feature recognition ; Linear transformations ; Machine learning ; Markov chains ; Microphones ; Neural networks ; Robustness ; Speech ; Speech recognition ; Voice recognition</subject><ispartof>International journal of electrical and computer engineering (Malacca, Malacca), 2018-12, Vol.8 (6), p.5381</ispartof><rights>Copyright IAES Institute of Advanced Engineering and Science Dec 2018</rights><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>314,776,780,27901,27902</link.rule.ids></links><search><creatorcontrib>Pardede, Hilman F.</creatorcontrib><creatorcontrib>Yuliani, Asri R.</creatorcontrib><creatorcontrib>Sustika, Rika</creatorcontrib><title>Convolutional Neural Network and Feature Transformation for Distant Speech Recognition</title><title>International journal of electrical and computer engineering (Malacca, Malacca)</title><description>In many applications, speech recognition must operate in conditions where there are some distances between speakers and the microphones. This is called distant speech recognition (DSR). In this condition, speech recognition must deal with reverberation. Nowadays, deep learning technologies are becoming the the main technologies for speech recognition. Deep Neural Network (DNN) in hybrid with Hidden Markov Model (HMM) is the commonly used architecture. However, this system is still not robust against reverberation. Previous studies use Convolutional Neural Networks (CNN), which is a variation of neural network, to improve the robustness of speech recognition against noise. CNN has the properties of pooling which is used to find local correlation between neighboring dimensions in the features. With this property, CNN could be used as feature learning emphasizing the information on neighboring frames. In this study we use CNN to deal with reverberation. We also propose to use feature transformation techniques: linear discriminat analysis (LDA) and maximum likelihood linear transformation (MLLT), on mel frequency cepstral coefficient (MFCC) before feeding them to CNN. We argue that transforming features could produce more discriminative features for CNN, and hence improve the robustness of speech recognition against reverberation. Our evaluations on Meeting Recorder Digits (MRD) subset of Aurora-5 database confirm that the use of LDA and MLLT transformations improve the robustness of speech recognition. It is better by 20% relative error reduction on compared to a standard DNN based speech recognition using the same number of hidden layers.</description><subject>Artificial neural networks</subject><subject>Digits</subject><subject>Error reduction</subject><subject>Feature recognition</subject><subject>Linear transformations</subject><subject>Machine learning</subject><subject>Markov chains</subject><subject>Microphones</subject><subject>Neural networks</subject><subject>Robustness</subject><subject>Speech</subject><subject>Speech recognition</subject><subject>Voice recognition</subject><issn>2088-8708</issn><issn>2088-8708</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2018</creationdate><recordtype>article</recordtype><sourceid>BENPR</sourceid><recordid>eNpNkE9LAzEQxYMoWGo_gwHPW_Nns0mOUq0KRUGr15DGiW5tN2uSrfjt3W09OId5D95jYH4InVMypVRoelmvwcF0p-pq2raCK1r0Sx2hESNKFUoSdfzPn6JJSmvSj6oqpsUIvc5CswubLtehsRv8AF3cS_4O8RPb5g3PweYuAl5G2yQf4tYOXdw7fF2nbJuMn1sA94GfwIX3ph7iM3Ti7SbB5E_H6GV-s5zdFYvH2_vZ1aJwtKxyIakTXFZlqXXluePMcUW45aSUTEsrKQCsgHinibeCOgmeCNUHq5IJTwgfo4vD3TaGrw5SNuvQxf6TZBijlBCmmexb8tByMaQUwZs21lsbfwwlZs_R7DmagaM5cDQDR_4Lg_JpfQ</recordid><startdate>20181201</startdate><enddate>20181201</enddate><creator>Pardede, Hilman F.</creator><creator>Yuliani, Asri R.</creator><creator>Sustika, Rika</creator><general>IAES Institute of Advanced Engineering and Science</general><scope>AAYXX</scope><scope>CITATION</scope><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>ARAPS</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>BVBZV</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>GNUQQ</scope><scope>HCIFZ</scope><scope>JQ2</scope><scope>K7-</scope><scope>L6V</scope><scope>M7S</scope><scope>P5Z</scope><scope>P62</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20181201</creationdate><title>Convolutional Neural Network and Feature Transformation for Distant Speech Recognition</title><author>Pardede, Hilman F. ; Yuliani, Asri R. ; Sustika, Rika</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c146t-71c537644996f3c32c3803a3047297a71eeebe0fc90fa51c7ef0587a7b425f003</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2018</creationdate><topic>Artificial neural networks</topic><topic>Digits</topic><topic>Error reduction</topic><topic>Feature recognition</topic><topic>Linear transformations</topic><topic>Machine learning</topic><topic>Markov chains</topic><topic>Microphones</topic><topic>Neural networks</topic><topic>Robustness</topic><topic>Speech</topic><topic>Speech recognition</topic><topic>Voice recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Pardede, Hilman F.</creatorcontrib><creatorcontrib>Yuliani, Asri R.</creatorcontrib><creatorcontrib>Sustika, Rika</creatorcontrib><collection>CrossRef</collection><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science &amp; Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>East &amp; South Asia Database</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>ProQuest Central Student</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Computer Science Collection</collection><collection>Computer Science Database</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Advanced Technologies &amp; Aerospace Database</collection><collection>ProQuest Advanced Technologies &amp; Aerospace Collection</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><jtitle>International journal of electrical and computer engineering (Malacca, Malacca)</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Pardede, Hilman F.</au><au>Yuliani, Asri R.</au><au>Sustika, Rika</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Convolutional Neural Network and Feature Transformation for Distant Speech Recognition</atitle><jtitle>International journal of electrical and computer engineering (Malacca, Malacca)</jtitle><date>2018-12-01</date><risdate>2018</risdate><volume>8</volume><issue>6</issue><spage>5381</spage><pages>5381-</pages><issn>2088-8708</issn><eissn>2088-8708</eissn><abstract>In many applications, speech recognition must operate in conditions where there are some distances between speakers and the microphones. This is called distant speech recognition (DSR). In this condition, speech recognition must deal with reverberation. Nowadays, deep learning technologies are becoming the the main technologies for speech recognition. Deep Neural Network (DNN) in hybrid with Hidden Markov Model (HMM) is the commonly used architecture. However, this system is still not robust against reverberation. Previous studies use Convolutional Neural Networks (CNN), which is a variation of neural network, to improve the robustness of speech recognition against noise. CNN has the properties of pooling which is used to find local correlation between neighboring dimensions in the features. With this property, CNN could be used as feature learning emphasizing the information on neighboring frames. In this study we use CNN to deal with reverberation. We also propose to use feature transformation techniques: linear discriminat analysis (LDA) and maximum likelihood linear transformation (MLLT), on mel frequency cepstral coefficient (MFCC) before feeding them to CNN. We argue that transforming features could produce more discriminative features for CNN, and hence improve the robustness of speech recognition against reverberation. Our evaluations on Meeting Recorder Digits (MRD) subset of Aurora-5 database confirm that the use of LDA and MLLT transformations improve the robustness of speech recognition. It is better by 20% relative error reduction on compared to a standard DNN based speech recognition using the same number of hidden layers.</abstract><cop>Yogyakarta</cop><pub>IAES Institute of Advanced Engineering and Science</pub><doi>10.11591/ijece.v8i6.pp5381-5388</doi></addata></record>
fulltext fulltext
identifier ISSN: 2088-8708
ispartof International journal of electrical and computer engineering (Malacca, Malacca), 2018-12, Vol.8 (6), p.5381
issn 2088-8708
2088-8708
language eng
recordid cdi_proquest_journals_2211002927
source Elektronische Zeitschriftenbibliothek - Frei zugängliche E-Journals
subjects Artificial neural networks
Digits
Error reduction
Feature recognition
Linear transformations
Machine learning
Markov chains
Microphones
Neural networks
Robustness
Speech
Speech recognition
Voice recognition
title Convolutional Neural Network and Feature Transformation for Distant Speech Recognition
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-02-06T00%3A56%3A29IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Convolutional%20Neural%20Network%20and%20Feature%20Transformation%20for%20Distant%20Speech%20Recognition&rft.jtitle=International%20journal%20of%20electrical%20and%20computer%20engineering%20(Malacca,%20Malacca)&rft.au=Pardede,%20Hilman%20F.&rft.date=2018-12-01&rft.volume=8&rft.issue=6&rft.spage=5381&rft.pages=5381-&rft.issn=2088-8708&rft.eissn=2088-8708&rft_id=info:doi/10.11591/ijece.v8i6.pp5381-5388&rft_dat=%3Cproquest_cross%3E2211002927%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2211002927&rft_id=info:pmid/&rfr_iscdi=true