Applying wav2vec2 for Speech Recognition on Bengali Common Voices Dataset

Speech is inherently continuous, where discrete words, phonemes and other units are not clearly segmented, and so speech recognition has been an active research problem for decades. In this work we have fine-tuned wav2vec 2.0 to recognize and transcribe Bengali speech -- training it on the Bengali C...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Shahgir, H. A. Z. Sameen, Sayeed, Khondker Salman, Zaman, Tanjeem Azwad
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Shahgir, H. A. Z. Sameen
Sayeed, Khondker Salman
Zaman, Tanjeem Azwad
description Speech is inherently continuous, where discrete words, phonemes and other units are not clearly segmented, and so speech recognition has been an active research problem for decades. In this work we have fine-tuned wav2vec 2.0 to recognize and transcribe Bengali speech -- training it on the Bengali Common Voice Speech Dataset. After training for 71 epochs, on a training set consisting of 36919 mp3 files, we achieved a training loss of 0.3172 and WER of 0.2524 on a validation set of size 7,747. Using a 5-gram language model, the Levenshtein Distance was 2.6446 on a test set of size 7,747. Then the training set and validation set were combined, shuffled and split into 85-15 ratio. Training for 7 more epochs on this combined dataset yielded an improved Levenshtein Distance of 2.60753 on the test set. Our model was the best performing one, achieving a Levenshtein Distance of 6.234 on a hidden dataset, which was 1.1049 units lower than other competing submissions.
doi_str_mv 10.48550/arxiv.2209.06581
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2209_06581</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2209_06581</sourcerecordid><originalsourceid>FETCH-LOGICAL-a671-9f2d6c9e0ce798312cabb3e1a71c96dddbd03ec2e07410d58e08c3ca3544aa763</originalsourceid><addsrcrecordid>eNotj8tKw0AYhWfjQqoP4KrzAolzSeayrPFWKAhaug1_Zv7EgSQTkhDt2xurcOBwNh_nI-SOszQzec7uYfwOSyoEsylTueHXZL8bhvYc-oZ-wSIWdILWcaQfA6L7pO_oYtOHOcSernnAvoE20CJ23TpPMTic6CPMMOF8Q65qaCe8_e8NOT4_HYvX5PD2si92hwSU5omthVfOInOorZFcOKgqiRw0d1Z57yvP5PoCmc4487lBZpx0IPMsA9BKbsj2D3txKYcxdDCey1-n8uIkfwDGoEcd</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Applying wav2vec2 for Speech Recognition on Bengali Common Voices Dataset</title><source>arXiv.org</source><creator>Shahgir, H. A. Z. Sameen ; Sayeed, Khondker Salman ; Zaman, Tanjeem Azwad</creator><creatorcontrib>Shahgir, H. A. Z. Sameen ; Sayeed, Khondker Salman ; Zaman, Tanjeem Azwad</creatorcontrib><description>Speech is inherently continuous, where discrete words, phonemes and other units are not clearly segmented, and so speech recognition has been an active research problem for decades. In this work we have fine-tuned wav2vec 2.0 to recognize and transcribe Bengali speech -- training it on the Bengali Common Voice Speech Dataset. After training for 71 epochs, on a training set consisting of 36919 mp3 files, we achieved a training loss of 0.3172 and WER of 0.2524 on a validation set of size 7,747. Using a 5-gram language model, the Levenshtein Distance was 2.6446 on a test set of size 7,747. Then the training set and validation set were combined, shuffled and split into 85-15 ratio. Training for 7 more epochs on this combined dataset yielded an improved Levenshtein Distance of 2.60753 on the test set. Our model was the best performing one, achieving a Levenshtein Distance of 6.234 on a hidden dataset, which was 1.1049 units lower than other competing submissions.</description><identifier>DOI: 10.48550/arxiv.2209.06581</identifier><language>eng</language><subject>Computer Science - Artificial Intelligence ; Computer Science - Learning</subject><creationdate>2022-09</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,776,881</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2209.06581$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2209.06581$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Shahgir, H. A. Z. Sameen</creatorcontrib><creatorcontrib>Sayeed, Khondker Salman</creatorcontrib><creatorcontrib>Zaman, Tanjeem Azwad</creatorcontrib><title>Applying wav2vec2 for Speech Recognition on Bengali Common Voices Dataset</title><description>Speech is inherently continuous, where discrete words, phonemes and other units are not clearly segmented, and so speech recognition has been an active research problem for decades. In this work we have fine-tuned wav2vec 2.0 to recognize and transcribe Bengali speech -- training it on the Bengali Common Voice Speech Dataset. After training for 71 epochs, on a training set consisting of 36919 mp3 files, we achieved a training loss of 0.3172 and WER of 0.2524 on a validation set of size 7,747. Using a 5-gram language model, the Levenshtein Distance was 2.6446 on a test set of size 7,747. Then the training set and validation set were combined, shuffled and split into 85-15 ratio. Training for 7 more epochs on this combined dataset yielded an improved Levenshtein Distance of 2.60753 on the test set. Our model was the best performing one, achieving a Levenshtein Distance of 6.234 on a hidden dataset, which was 1.1049 units lower than other competing submissions.</description><subject>Computer Science - Artificial Intelligence</subject><subject>Computer Science - Learning</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj8tKw0AYhWfjQqoP4KrzAolzSeayrPFWKAhaug1_Zv7EgSQTkhDt2xurcOBwNh_nI-SOszQzec7uYfwOSyoEsylTueHXZL8bhvYc-oZ-wSIWdILWcaQfA6L7pO_oYtOHOcSernnAvoE20CJ23TpPMTic6CPMMOF8Q65qaCe8_e8NOT4_HYvX5PD2si92hwSU5omthVfOInOorZFcOKgqiRw0d1Z57yvP5PoCmc4487lBZpx0IPMsA9BKbsj2D3txKYcxdDCey1-n8uIkfwDGoEcd</recordid><startdate>20220911</startdate><enddate>20220911</enddate><creator>Shahgir, H. A. Z. Sameen</creator><creator>Sayeed, Khondker Salman</creator><creator>Zaman, Tanjeem Azwad</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20220911</creationdate><title>Applying wav2vec2 for Speech Recognition on Bengali Common Voices Dataset</title><author>Shahgir, H. A. Z. Sameen ; Sayeed, Khondker Salman ; Zaman, Tanjeem Azwad</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a671-9f2d6c9e0ce798312cabb3e1a71c96dddbd03ec2e07410d58e08c3ca3544aa763</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Computer Science - Artificial Intelligence</topic><topic>Computer Science - Learning</topic><toplevel>online_resources</toplevel><creatorcontrib>Shahgir, H. A. Z. Sameen</creatorcontrib><creatorcontrib>Sayeed, Khondker Salman</creatorcontrib><creatorcontrib>Zaman, Tanjeem Azwad</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Shahgir, H. A. Z. Sameen</au><au>Sayeed, Khondker Salman</au><au>Zaman, Tanjeem Azwad</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Applying wav2vec2 for Speech Recognition on Bengali Common Voices Dataset</atitle><date>2022-09-11</date><risdate>2022</risdate><abstract>Speech is inherently continuous, where discrete words, phonemes and other units are not clearly segmented, and so speech recognition has been an active research problem for decades. In this work we have fine-tuned wav2vec 2.0 to recognize and transcribe Bengali speech -- training it on the Bengali Common Voice Speech Dataset. After training for 71 epochs, on a training set consisting of 36919 mp3 files, we achieved a training loss of 0.3172 and WER of 0.2524 on a validation set of size 7,747. Using a 5-gram language model, the Levenshtein Distance was 2.6446 on a test set of size 7,747. Then the training set and validation set were combined, shuffled and split into 85-15 ratio. Training for 7 more epochs on this combined dataset yielded an improved Levenshtein Distance of 2.60753 on the test set. Our model was the best performing one, achieving a Levenshtein Distance of 6.234 on a hidden dataset, which was 1.1049 units lower than other competing submissions.</abstract><doi>10.48550/arxiv.2209.06581</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2209.06581
ispartof
issn
language eng
recordid cdi_arxiv_primary_2209_06581
source arXiv.org
subjects Computer Science - Artificial Intelligence
Computer Science - Learning
title Applying wav2vec2 for Speech Recognition on Bengali Common Voices Dataset
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-25T04%3A46%3A59IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Applying%20wav2vec2%20for%20Speech%20Recognition%20on%20Bengali%20Common%20Voices%20Dataset&rft.au=Shahgir,%20H.%20A.%20Z.%20Sameen&rft.date=2022-09-11&rft_id=info:doi/10.48550/arxiv.2209.06581&rft_dat=%3Carxiv_GOX%3E2209_06581%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true