1SPU: 1-step Speech Processing Unit

Recent studies have made some progress in refining end-to-end (E2E) speech recognition encoders by applying Connectionist Temporal Classification (CTC) loss to enhance named entity recognition within transcriptions. However, these methods have been constrained by their exclusive use of the ASCII cha...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	arXiv.org 2023-12
Hauptverfasser:	Singla, Karan, Jalalvand, Shahab, Yeon-Jun, Kim, Antonio Moreno Daniel, Bangalore, Srinivas, Ljolje, Andrej, Stern, Ben
Format:	Artikel
Sprache:	eng
Schlagworte:	Automatic speech recognition Semantics Speech processing Tags Voice recognition
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title	arXiv.org
container_volume
creator	Singla, Karan Jalalvand, Shahab Yeon-Jun, Kim Antonio Moreno Daniel Bangalore, Srinivas Ljolje, Andrej Stern, Ben
description	Recent studies have made some progress in refining end-to-end (E2E) speech recognition encoders by applying Connectionist Temporal Classification (CTC) loss to enhance named entity recognition within transcriptions. However, these methods have been constrained by their exclusive use of the ASCII character set, allowing only a limited array of semantic labels. We propose 1SPU, a 1-step Speech Processing Unit which can recognize speech events (e.g: speaker change) or an NL event (Intent, Emotion) while also transcribing vocal content. It extends the E2E automatic speech recognition (ASR) system's vocabulary by adding a set of unused placeholder symbols, conceptually akin to the tokens used in sequence modeling. These placeholders are then assigned to represent semantic events (in form of tags) and are integrated into the transcription process as distinct tokens. We demonstrate notable improvements on the SLUE benchmark and yields results that are on par with those for the SLURP dataset. Additionally, we provide a visual analysis of the system's proficiency in accurately pinpointing meaningful tokens over time, illustrating the enhancement in transcription quality through the utilization of supplementary semantic tags.
format	Article
fullrecord	<record><control><sourceid>proquest</sourceid><recordid>TN_cdi_proquest_journals_2887708194</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2887708194</sourcerecordid><originalsourceid>FETCH-proquest_journals_28877081943</originalsourceid><addsrcrecordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mRQNgwOCLVSMNQtLkktUAguSE1NzlAIKMpPTi0uzsxLVwjNyyzhYWBNS8wpTuWF0twMym6uIc4eugVF-YWlqcUl8Vn5pUV5QKl4IwsLc3MDC0NLE2PiVAEACLktJg</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2887708194</pqid></control><display><type>article</type><title>1SPU: 1-step Speech Processing Unit</title><source>Free E- Journals</source><creator>Singla, Karan ; Jalalvand, Shahab ; Yeon-Jun, Kim ; Antonio Moreno Daniel ; Bangalore, Srinivas ; Ljolje, Andrej ; Stern, Ben</creator><creatorcontrib>Singla, Karan ; Jalalvand, Shahab ; Yeon-Jun, Kim ; Antonio Moreno Daniel ; Bangalore, Srinivas ; Ljolje, Andrej ; Stern, Ben</creatorcontrib><description>Recent studies have made some progress in refining end-to-end (E2E) speech recognition encoders by applying Connectionist Temporal Classification (CTC) loss to enhance named entity recognition within transcriptions. However, these methods have been constrained by their exclusive use of the ASCII character set, allowing only a limited array of semantic labels. We propose 1SPU, a 1-step Speech Processing Unit which can recognize speech events (e.g: speaker change) or an NL event (Intent, Emotion) while also transcribing vocal content. It extends the E2E automatic speech recognition (ASR) system's vocabulary by adding a set of unused placeholder symbols, conceptually akin to the tokens used in sequence modeling. These placeholders are then assigned to represent semantic events (in form of tags) and are integrated into the transcription process as distinct tokens. We demonstrate notable improvements on the SLUE benchmark and yields results that are on par with those for the SLURP dataset. Additionally, we provide a visual analysis of the system's proficiency in accurately pinpointing meaningful tokens over time, illustrating the enhancement in transcription quality through the utilization of supplementary semantic tags.</description><identifier>EISSN: 2331-8422</identifier><language>eng</language><publisher>Ithaca: Cornell University Library, arXiv.org</publisher><subject>Automatic speech recognition ; Semantics ; Speech processing ; Tags ; Voice recognition</subject><ispartof>arXiv.org, 2023-12</ispartof><rights>2023. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>780,784</link.rule.ids></links><search><creatorcontrib>Singla, Karan</creatorcontrib><creatorcontrib>Jalalvand, Shahab</creatorcontrib><creatorcontrib>Yeon-Jun, Kim</creatorcontrib><creatorcontrib>Antonio Moreno Daniel</creatorcontrib><creatorcontrib>Bangalore, Srinivas</creatorcontrib><creatorcontrib>Ljolje, Andrej</creatorcontrib><creatorcontrib>Stern, Ben</creatorcontrib><title>1SPU: 1-step Speech Processing Unit</title><title>arXiv.org</title><description>Recent studies have made some progress in refining end-to-end (E2E) speech recognition encoders by applying Connectionist Temporal Classification (CTC) loss to enhance named entity recognition within transcriptions. However, these methods have been constrained by their exclusive use of the ASCII character set, allowing only a limited array of semantic labels. We propose 1SPU, a 1-step Speech Processing Unit which can recognize speech events (e.g: speaker change) or an NL event (Intent, Emotion) while also transcribing vocal content. It extends the E2E automatic speech recognition (ASR) system's vocabulary by adding a set of unused placeholder symbols, conceptually akin to the tokens used in sequence modeling. These placeholders are then assigned to represent semantic events (in form of tags) and are integrated into the transcription process as distinct tokens. We demonstrate notable improvements on the SLUE benchmark and yields results that are on par with those for the SLURP dataset. Additionally, we provide a visual analysis of the system's proficiency in accurately pinpointing meaningful tokens over time, illustrating the enhancement in transcription quality through the utilization of supplementary semantic tags.</description><subject>Automatic speech recognition</subject><subject>Semantics</subject><subject>Speech processing</subject><subject>Tags</subject><subject>Voice recognition</subject><issn>2331-8422</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>ABUWG</sourceid><sourceid>AFKRA</sourceid><sourceid>AZQEC</sourceid><sourceid>BENPR</sourceid><sourceid>CCPQU</sourceid><sourceid>DWQXO</sourceid><recordid>eNpjYuA0MjY21LUwMTLiYOAtLs4yMDAwMjM3MjU15mRQNgwOCLVSMNQtLkktUAguSE1NzlAIKMpPTi0uzsxLVwjNyyzhYWBNS8wpTuWF0twMym6uIc4eugVF-YWlqcUl8Vn5pUV5QKl4IwsLc3MDC0NLE2PiVAEACLktJg</recordid><startdate>20231210</startdate><enddate>20231210</enddate><creator>Singla, Karan</creator><creator>Jalalvand, Shahab</creator><creator>Yeon-Jun, Kim</creator><creator>Antonio Moreno Daniel</creator><creator>Bangalore, Srinivas</creator><creator>Ljolje, Andrej</creator><creator>Stern, Ben</creator><general>Cornell University Library, arXiv.org</general><scope>8FE</scope><scope>8FG</scope><scope>ABJCF</scope><scope>ABUWG</scope><scope>AFKRA</scope><scope>AZQEC</scope><scope>BENPR</scope><scope>BGLVJ</scope><scope>CCPQU</scope><scope>DWQXO</scope><scope>HCIFZ</scope><scope>L6V</scope><scope>M7S</scope><scope>PIMPY</scope><scope>PQEST</scope><scope>PQQKQ</scope><scope>PQUKI</scope><scope>PRINS</scope><scope>PTHSS</scope></search><sort><creationdate>20231210</creationdate><title>1SPU: 1-step Speech Processing Unit</title><author>Singla, Karan ; Jalalvand, Shahab ; Yeon-Jun, Kim ; Antonio Moreno Daniel ; Bangalore, Srinivas ; Ljolje, Andrej ; Stern, Ben</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-proquest_journals_28877081943</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Automatic speech recognition</topic><topic>Semantics</topic><topic>Speech processing</topic><topic>Tags</topic><topic>Voice recognition</topic><toplevel>online_resources</toplevel><creatorcontrib>Singla, Karan</creatorcontrib><creatorcontrib>Jalalvand, Shahab</creatorcontrib><creatorcontrib>Yeon-Jun, Kim</creatorcontrib><creatorcontrib>Antonio Moreno Daniel</creatorcontrib><creatorcontrib>Bangalore, Srinivas</creatorcontrib><creatorcontrib>Ljolje, Andrej</creatorcontrib><creatorcontrib>Stern, Ben</creatorcontrib><collection>ProQuest SciTech Collection</collection><collection>ProQuest Technology Collection</collection><collection>Materials Science & Engineering Collection</collection><collection>ProQuest Central (Alumni Edition)</collection><collection>ProQuest Central UK/Ireland</collection><collection>ProQuest Central Essentials</collection><collection>ProQuest Central</collection><collection>Technology Collection</collection><collection>ProQuest One Community College</collection><collection>ProQuest Central Korea</collection><collection>SciTech Premium Collection</collection><collection>ProQuest Engineering Collection</collection><collection>Engineering Database</collection><collection>Publicly Available Content Database</collection><collection>ProQuest One Academic Eastern Edition (DO NOT USE)</collection><collection>ProQuest One Academic</collection><collection>ProQuest One Academic UKI Edition</collection><collection>ProQuest Central China</collection><collection>Engineering Collection</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Singla, Karan</au><au>Jalalvand, Shahab</au><au>Yeon-Jun, Kim</au><au>Antonio Moreno Daniel</au><au>Bangalore, Srinivas</au><au>Ljolje, Andrej</au><au>Stern, Ben</au><format>book</format><genre>document</genre><ristype>GEN</ristype><atitle>1SPU: 1-step Speech Processing Unit</atitle><jtitle>arXiv.org</jtitle><date>2023-12-10</date><risdate>2023</risdate><eissn>2331-8422</eissn><abstract>Recent studies have made some progress in refining end-to-end (E2E) speech recognition encoders by applying Connectionist Temporal Classification (CTC) loss to enhance named entity recognition within transcriptions. However, these methods have been constrained by their exclusive use of the ASCII character set, allowing only a limited array of semantic labels. We propose 1SPU, a 1-step Speech Processing Unit which can recognize speech events (e.g: speaker change) or an NL event (Intent, Emotion) while also transcribing vocal content. It extends the E2E automatic speech recognition (ASR) system's vocabulary by adding a set of unused placeholder symbols, conceptually akin to the tokens used in sequence modeling. These placeholders are then assigned to represent semantic events (in form of tags) and are integrated into the transcription process as distinct tokens. We demonstrate notable improvements on the SLUE benchmark and yields results that are on par with those for the SLURP dataset. Additionally, we provide a visual analysis of the system's proficiency in accurately pinpointing meaningful tokens over time, illustrating the enhancement in transcription quality through the utilization of supplementary semantic tags.</abstract><cop>Ithaca</cop><pub>Cornell University Library, arXiv.org</pub><oa>free_for_read</oa></addata></record>
fulltext	fulltext
identifier	EISSN: 2331-8422
ispartof	arXiv.org, 2023-12
issn	2331-8422
language	eng
recordid	cdi_proquest_journals_2887708194
source	Free E- Journals
subjects	Automatic speech recognition Semantics Speech processing Tags Voice recognition
title	1SPU: 1-step Speech Processing Unit
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-26T08%3A44%3A36IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest&rft_val_fmt=info:ofi/fmt:kev:mtx:book&rft.genre=document&rft.atitle=1SPU:%201-step%20Speech%20Processing%20Unit&rft.jtitle=arXiv.org&rft.au=Singla,%20Karan&rft.date=2023-12-10&rft.eissn=2331-8422&rft_id=info:doi/&rft_dat=%3Cproquest%3E2887708194%3C/proquest%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2887708194&rft_id=info:pmid/&rfr_iscdi=true