SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

We present SpecAugment, a simple data augmentation method for speech recognition. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients). The augmentation policy consists of warping the features, masking blocks of frequency channels, and masking bl...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2019-12
Hauptverfasser:	Park, Daniel S, Chan, William, Zhang, Yu, Chung-Cheng, Chiu, Barret Zoph, Cubuk, Ekin D, Le, Quoc V
Format:	Artikel
Sprache:	eng
Schlagworte:	Automatic speech recognition Computer Science - Computation and Language Computer Science - Learning Computer Science - Sound Data augmentation Filter banks Hybrid systems Masking Neural networks Statistics - Machine Learning Switching theory Voice recognition
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Park, Daniel S Chan, William Zhang, Yu Chung-Cheng, Chiu Barret Zoph Cubuk, Ekin D Le, Quoc V
description	We present SpecAugment, a simple data augmentation method for speech recognition. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients). The augmentation policy consists of warping the features, masking blocks of frequency channels, and masking blocks of time steps. We apply SpecAugment on Listen, Attend and Spell networks for end-to-end speech recognition tasks. We achieve state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work. On LibriSpeech, we achieve 6.8% WER on test-other without the use of a language model, and 5.8% WER with shallow fusion with a language model. This compares to the previous state-of-the-art hybrid system of 7.5% WER. For Switchboard, we achieve 7.2%/14.6% on the Switchboard/CallHome portion of the Hub5'00 test set without the use of a language model, and 6.8%/14.1% with shallow fusion, which compares to the previous state-of-the-art hybrid system at 8.3%/17.3% WER.
doi_str_mv	10.48550/arxiv.1904.08779
format	Article
fullrecord	<record><control><sourceid>proquest_arxiv</sourceid><recordid>TN_cdi_arxiv_primary_1904_08779</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2211480144</sourcerecordid><originalsourceid>FETCH-LOGICAL-a1004-2285e1c7280b42a244b5b049f44b8c1033baec37de31338cec90a1f0a324776f3</originalsourceid><addsrcrecordid>eNotj8tqwzAQRUWh0JDmA7qqoGu7o5EUyd2Z9AkpgaZdG1mRE4fYcm2ltH9f5bGa4XDuMJeQGwap0FLCvel_65-UZSBS0EplF2SEnLNEC8QrMhmGLQDgVKGUfEQWy87ZfL9uXBseaE6XddPtHH00wdAzNqH2LX13YeNXtPJ95ME3kVoaw85u6Iezft3WB--aXFZmN7jJeY7J1_PT5-w1mS9e3mb5PDEMQCSIWjpmFWooBRoUopQliKyKi7YMOC-Ns1ytHGeca-tsBoZVYDgKpaYVH5Pb091j3aLr68b0f8WhdnGsHY27k9H1_nvvhlBs_b5v41MFImNCAxOC_wOHglle</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2211480144</pqid></control><display><type>article</type><title>SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition</title><source>arXiv.org</source><source>Free E- Journals</source><creator>Park, Daniel S ; Chan, William ; Zhang, Yu ; Chung-Cheng, Chiu ; Barret Zoph ; Cubuk, Ekin D ; Le, Quoc V</creator><creatorcontrib>Park, Daniel S ; Chan, William ; Zhang, Yu ; Chung-Cheng, Chiu ; Barret Zoph ; Cubuk, Ekin D ; Le, Quoc V</creatorcontrib><description>We present SpecAugment, a simple data augmentation method for speech recognition. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients). The augmentation policy consists of warping the features, masking blocks of frequency channels, and masking blocks of time steps. We apply SpecAugment on Listen, Attend and Spell networks for end-to-end speech recognition tasks. We achieve state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work. On LibriSpeech, we achieve 6.8% WER on test-other without the use of a language model, and 5.8% WER with shallow fusion with a language model. This compares to the previous state-of-the-art hybrid system of 7.5% WER. For Switchboard, we achieve 7.2%/14.6% on the Switchboard/CallHome portion of the Hub5'00 test set without the use of a language model, and 6.8%/14.1% with shallow fusion, which compares to the previous state-of-the-art hybrid system at 8.3%/17.3% WER.</description><identifier>EISSN: 2331-8422</identifier><identifier>DOI: 10.48550/arxiv.1904.08779</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Automatic speech recognition ; Computer Science - Computation and Language ; Computer Science - Learning ; Computer Science - Sound ; Data augmentation ; Filter banks ; Hybrid systems ; Masking ; Neural networks ; Statistics - Machine Learning ; Switching theory ; Voice recognition</subject><ispartof>arXiv.org, 2019-12</ispartof><rights>2019. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-a1004-2285e1c7280b42a244b5b049f44b8c1033baec37de31338cec90a1f0a324776f3</citedby></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,784,885,27925</link.rule.ids><backlink>$$Uhttps://doi.org/10.48550/arXiv.1904.08779$$DView paper in arXiv$$Hfree_for_read</backlink><backlink>$$Uhttps://doi.org/10.21437/Interspeech.2019-2680$$DView published paper (Access to full text may be restricted)$$Hfree_for_read</backlink></links><search><creatorcontrib>Park, Daniel S</creatorcontrib><creatorcontrib>Chan, William</creatorcontrib><creatorcontrib>Zhang, Yu</creatorcontrib><creatorcontrib>Chung-Cheng, Chiu</creatorcontrib><creatorcontrib>Barret Zoph</creatorcontrib><creatorcontrib>Cubuk, Ekin D</creatorcontrib><creatorcontrib>Le, Quoc V</creatorcontrib><title>SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition</title><title>arXiv.org</title><description>We present SpecAugment, a simple data augmentation method for speech recognition. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients). The augmentation policy consists of warping the features, masking blocks of frequency channels, and masking blocks of time steps. We apply SpecAugment on Listen, Attend and Spell networks for end-to-end speech recognition tasks. We achieve state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work. On LibriSpeech, we achieve 6.8% WER on test-other without the use of a language model, and 5.8% WER with shallow fusion with a language model. This compares to the previous state-of-the-art hybrid system of 7.5% WER. For Switchboard, we achieve 7.2%/14.6% on the Switchboard/CallHome portion of the Hub5'00 test set without the use of a language model, and 6.8%/14.1% with shallow fusion, which compares to the previous state-of-the-art hybrid system at 8.3%/17.3% WER.</description><subject>Automatic speech recognition</subject><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Learning</subject><subject>Computer Science - Sound</subject><subject>Data augmentation</subject><subject>Filter banks</subject><subject>Hybrid systems</subject><subject>Masking</subject><subject>Neural networks</subject><subject>Statistics - Machine Learning</subject><subject>Switching theory</subject><subject>Voice recognition</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2019</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><sourceid>GOX</sourceid><recordid>eNotj8tqwzAQRUWh0JDmA7qqoGu7o5EUyd2Z9AkpgaZdG1mRE4fYcm2ltH9f5bGa4XDuMJeQGwap0FLCvel_65-UZSBS0EplF2SEnLNEC8QrMhmGLQDgVKGUfEQWy87ZfL9uXBseaE6XddPtHH00wdAzNqH2LX13YeNXtPJ95ME3kVoaw85u6Iezft3WB--aXFZmN7jJeY7J1_PT5-w1mS9e3mb5PDEMQCSIWjpmFWooBRoUopQliKyKi7YMOC-Ns1ytHGeca-tsBoZVYDgKpaYVH5Pb091j3aLr68b0f8WhdnGsHY27k9H1_nvvhlBs_b5v41MFImNCAxOC_wOHglle</recordid><startdate>20191203</startdate><enddate>20191203</enddate><creator>Park, Daniel S</creator><creator>Chan, William</creator><creator>Zhang, Yu</creator><creator>Chung-Cheng, Chiu</creator><creator>Barret Zoph</creator><creator>Cubuk, Ekin D</creator><creator>Le, Quoc V</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope><scope>AKY</scope><scope>EPD</scope><scope>GOX</scope></search><sort><creationdate>20191203</creationdate><title>SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition</title><author>Park, Daniel S ; Chan, William ; Zhang, Yu ; Chung-Cheng, Chiu ; Barret Zoph ; Cubuk, Ekin D ; Le, Quoc V</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a1004-2285e1c7280b42a244b5b049f44b8c1033baec37de31338cec90a1f0a324776f3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2019</creationdate><topic>Automatic speech recognition</topic><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Learning</topic><topic>Computer Science - Sound</topic><topic>Data augmentation</topic><topic>Filter banks</topic><topic>Hybrid systems</topic><topic>Masking</topic><topic>Neural networks</topic><topic>Statistics - Machine Learning</topic><topic>Switching theory</topic><topic>Voice recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Park, Daniel S</creatorcontrib><creatorcontrib>Chan, William</creatorcontrib><creatorcontrib>Zhang, Yu</creatorcontrib><creatorcontrib>Chung-Cheng, Chiu</creatorcontrib><creatorcontrib>Barret Zoph</creatorcontrib><creatorcontrib>Cubuk, Ekin D</creatorcontrib><creatorcontrib>Le, Quoc V</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection><collection>arXiv Computer Science</collection><collection>arXiv Statistics</collection><collection>arXiv.org</collection><jtitle>arXiv.org</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Park, Daniel S</au><au>Chan, William</au><au>Zhang, Yu</au><au>Chung-Cheng, Chiu</au><au>Barret Zoph</au><au>Cubuk, Ekin D</au><au>Le, Quoc V</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition</atitle><jtitle>arXiv.org</jtitle><date>2019-12-03</date><risdate>2019</risdate><eissn>2331-8422</eissn><abstract>We present SpecAugment, a simple data augmentation method for speech recognition. SpecAugment is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients). The augmentation policy consists of warping the features, masking blocks of frequency channels, and masking blocks of time steps. We apply SpecAugment on Listen, Attend and Spell networks for end-to-end speech recognition tasks. We achieve state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work. On LibriSpeech, we achieve 6.8% WER on test-other without the use of a language model, and 5.8% WER with shallow fusion with a language model. This compares to the previous state-of-the-art hybrid system of 7.5% WER. For Switchboard, we achieve 7.2%/14.6% on the Switchboard/CallHome portion of the Hub5'00 test set without the use of a language model, and 6.8%/14.1% with shallow fusion, which compares to the previous state-of-the-art hybrid system at 8.3%/17.3% WER.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><doi>10.48550/arxiv.1904.08779</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2019-12
issn	2331-8422
language	eng
recordid	cdi_arxiv_primary_1904_08779
source	arXiv.org; Free E- Journals
subjects	Automatic speech recognition Computer Science - Computation and Language Computer Science - Learning Computer Science - Sound Data augmentation Filter banks Hybrid systems Masking Neural networks Statistics - Machine Learning Switching theory Voice recognition
title	SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-27T21%3A26%3A30IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_arxiv&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=SpecAugment:%20A%20Simple%20Data%20Augmentation%20Method%20for%20Automatic%20Speech%20Recognition&rft.jtitle=arXiv.org&rft.au=Park,%20Daniel%20S&rft.date=2019-12-03&rft.eissn=2331-8422&rft_id=info:doi/10.48550/arxiv.1904.08779&rft_dat=%3Cproquest_arxiv%3E2211480144%3C/proquest_arxiv%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2211480144&rft_id=info:pmid/&rfr_iscdi=true