Applying wav2vec2 for Speech Recognition on Bengali Common Voices Dataset

Speech is inherently continuous, where discrete words, phonemes and other units are not clearly segmented, and so speech recognition has been an active research problem for decades. In this work we have fine-tuned wav2vec 2.0 to recognize and transcribe Bengali speech -- training it on the Bengali C...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Shahgir, H. A. Z. Sameen, Sayeed, Khondker Salman, Zaman, Tanjeem Azwad
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Artificial Intelligence Computer Science - Learning
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Shahgir, H. A. Z. Sameen Sayeed, Khondker Salman Zaman, Tanjeem Azwad
description	Speech is inherently continuous, where discrete words, phonemes and other units are not clearly segmented, and so speech recognition has been an active research problem for decades. In this work we have fine-tuned wav2vec 2.0 to recognize and transcribe Bengali speech -- training it on the Bengali Common Voice Speech Dataset. After training for 71 epochs, on a training set consisting of 36919 mp3 files, we achieved a training loss of 0.3172 and WER of 0.2524 on a validation set of size 7,747. Using a 5-gram language model, the Levenshtein Distance was 2.6446 on a test set of size 7,747. Then the training set and validation set were combined, shuffled and split into 85-15 ratio. Training for 7 more epochs on this combined dataset yielded an improved Levenshtein Distance of 2.60753 on the test set. Our model was the best performing one, achieving a Levenshtein Distance of 6.234 on a hidden dataset, which was 1.1049 units lower than other competing submissions.
doi_str_mv	10.48550/arxiv.2209.06581
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2209_06581</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2209_06581</sourcerecordid><originalsourceid>FETCH-LOGICAL-a671-9f2d6c9e0ce798312cabb3e1a71c96dddbd03ec2e07410d58e08c3ca3544aa763</originalsourceid><addsrcrecordid>eNotj8tKw0AYhWfjQqoP4KrzAolzSeayrPFWKAhaug1_Zv7EgSQTkhDt2xurcOBwNh_nI-SOszQzec7uYfwOSyoEsylTueHXZL8bhvYc-oZ-wSIWdILWcaQfA6L7pO_oYtOHOcSernnAvoE20CJ23TpPMTic6CPMMOF8Q65qaCe8_e8NOT4_HYvX5PD2si92hwSU5omthVfOInOorZFcOKgqiRw0d1Z57yvP5PoCmc4487lBZpx0IPMsA9BKbsj2D3txKYcxdDCey1-n8uIkfwDGoEcd</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Applying wav2vec2 for Speech Recognition on Bengali Common Voices Dataset</title><source>arXiv.org</source><creator>Shahgir, H. A. Z. Sameen ; Sayeed, Khondker Salman ; Zaman, Tanjeem Azwad</creator><creatorcontrib>Shahgir, H. A. Z. Sameen ; Sayeed, Khondker Salman ; Zaman, Tanjeem Azwad</creatorcontrib><description>Speech is inherently continuous, where discrete words, phonemes and other units are not clearly segmented, and so speech recognition has been an active research problem for decades. In this work we have fine-tuned wav2vec 2.0 to recognize and transcribe Bengali speech -- training it on the Bengali Common Voice Speech Dataset. After training for 71 epochs, on a training set consisting of 36919 mp3 files, we achieved a training loss of 0.3172 and WER of 0.2524 on a validation set of size 7,747. Using a 5-gram language model, the Levenshtein Distance was 2.6446 on a test set of size 7,747. Then the training set and validation set were combined, shuffled and split into 85-15 ratio. Training for 7 more epochs on this combined dataset yielded an improved Levenshtein Distance of 2.60753 on the test set. Our model was the best performing one, achieving a Levenshtein Distance of 6.234 on a hidden dataset, which was 1.1049 units lower than other competing submissions.</description><identifier>DOI: 10.48550/arxiv.2209.06581</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Learning</subject><creationdate>2022-09</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2209.06581$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2209.06581$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Shahgir, H. A. Z. Sameen</creatorcontrib><creatorcontrib>Sayeed, Khondker Salman</creatorcontrib><creatorcontrib>Zaman, Tanjeem Azwad</creatorcontrib><title>Applying wav2vec2 for Speech Recognition on Bengali Common Voices Dataset</title><description>Speech is inherently continuous, where discrete words, phonemes and other units are not clearly segmented, and so speech recognition has been an active research problem for decades. In this work we have fine-tuned wav2vec 2.0 to recognize and transcribe Bengali speech -- training it on the Bengali Common Voice Speech Dataset. After training for 71 epochs, on a training set consisting of 36919 mp3 files, we achieved a training loss of 0.3172 and WER of 0.2524 on a validation set of size 7,747. Using a 5-gram language model, the Levenshtein Distance was 2.6446 on a test set of size 7,747. Then the training set and validation set were combined, shuffled and split into 85-15 ratio. Training for 7 more epochs on this combined dataset yielded an improved Levenshtein Distance of 2.60753 on the test set. Our model was the best performing one, achieving a Levenshtein Distance of 6.234 on a hidden dataset, which was 1.1049 units lower than other competing submissions.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tKw0AYhWfjQqoP4KrzAolzSeayrPFWKAhaug1_Zv7EgSQTkhDt2xurcOBwNh_nI-SOszQzec7uYfwOSyoEsylTueHXZL8bhvYc-oZ-wSIWdILWcaQfA6L7pO_oYtOHOcSernnAvoE20CJ23TpPMTic6CPMMOF8Q65qaCe8_e8NOT4_HYvX5PD2si92hwSU5omthVfOInOorZFcOKgqiRw0d1Z57yvP5PoCmc4487lBZpx0IPMsA9BKbsj2D3txKYcxdDCey1-n8uIkfwDGoEcd</recordid><startdate>20220911</startdate><enddate>20220911</enddate><creator>Shahgir, H. A. Z. Sameen</creator><creator>Sayeed, Khondker Salman</creator><creator>Zaman, Tanjeem Azwad</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20220911</creationdate><title>Applying wav2vec2 for Speech Recognition on Bengali Common Voices Dataset</title><author>Shahgir, H. A. Z. Sameen ; Sayeed, Khondker Salman ; Zaman, Tanjeem Azwad</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a671-9f2d6c9e0ce798312cabb3e1a71c96dddbd03ec2e07410d58e08c3ca3544aa763</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Shahgir, H. A. Z. Sameen</creatorcontrib><creatorcontrib>Sayeed, Khondker Salman</creatorcontrib><creatorcontrib>Zaman, Tanjeem Azwad</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Shahgir, H. A. Z. Sameen</au><au>Sayeed, Khondker Salman</au><au>Zaman, Tanjeem Azwad</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Applying wav2vec2 for Speech Recognition on Bengali Common Voices Dataset</atitle><date>2022-09-11</date><risdate>2022</risdate><abstract>Speech is inherently continuous, where discrete words, phonemes and other units are not clearly segmented, and so speech recognition has been an active research problem for decades. In this work we have fine-tuned wav2vec 2.0 to recognize and transcribe Bengali speech -- training it on the Bengali Common Voice Speech Dataset. After training for 71 epochs, on a training set consisting of 36919 mp3 files, we achieved a training loss of 0.3172 and WER of 0.2524 on a validation set of size 7,747. Using a 5-gram language model, the Levenshtein Distance was 2.6446 on a test set of size 7,747. Then the training set and validation set were combined, shuffled and split into 85-15 ratio. Training for 7 more epochs on this combined dataset yielded an improved Levenshtein Distance of 2.60753 on the test set. Our model was the best performing one, achieving a Levenshtein Distance of 6.234 on a hidden dataset, which was 1.1049 units lower than other competing submissions.</abstract><doi>10.48550/arxiv.2209.06581</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2209.06581
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2209_06581
source	arXiv.org
subjects	Computer Science - Artificial Intelligence Computer Science - Learning
title	Applying wav2vec2 for Speech Recognition on Bengali Common Voices Dataset
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-25T04%3A46%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Applying%20wav2vec2%20for%20Speech%20Recognition%20on%20Bengali%20Common%20Voices%20Dataset&rft.au=Shahgir,%20H.%20A.%20Z.%20Sameen&rft.date=2022-09-11&rft_id=info:doi/10.48550/arxiv.2209.06581&rft_dat=%3Carxiv_GOX%3E2209_06581%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true