High-precision Voice Search Query Correction via Retrievable Speech-text Embedings

Automatic speech recognition (ASR) systems can suffer from poor recall for various reasons, such as noisy audio, lack of sufficient training data, etc. Previous work has shown that recall can be improved by retrieving rewrite candidates from a large database of likely, contextually-relevant alternat...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Li, Christopher, Wang, Gary, Kastner, Kyle, Su, Heng, Chen, Allen, Rosenberg, Andrew, Chen, Zhehuai, Wu, Zelin, Velikovich, Leonid, Rondon, Pat, Caseiro, Diamantino, Aleksic, Petar
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Computation and Language Computer Science - Sound
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Li, Christopher Wang, Gary Kastner, Kyle Su, Heng Chen, Allen Rosenberg, Andrew Chen, Zhehuai Wu, Zelin Velikovich, Leonid Rondon, Pat Caseiro, Diamantino Aleksic, Petar
description	Automatic speech recognition (ASR) systems can suffer from poor recall for various reasons, such as noisy audio, lack of sufficient training data, etc. Previous work has shown that recall can be improved by retrieving rewrite candidates from a large database of likely, contextually-relevant alternatives to the hypothesis text using nearest-neighbors search over embeddings of the ASR hypothesis text to correct and candidate corrections. However, ASR-hypothesis-based retrieval can yield poor precision if the textual hypotheses are too phonetically dissimilar to the transcript truth. In this paper, we eliminate the hypothesis-audio mismatch problem by querying the correction database directly using embeddings derived from the utterance audio; the embeddings of the utterance audio and candidate corrections are produced by multimodal speech-text embedding networks trained to place the embedding of the audio of an utterance and the embedding of its corresponding textual transcript close together. After locating an appropriate correction candidate using nearest-neighbor search, we score the candidate with its speech-text embedding distance before adding the candidate to the original n-best list. We show a relative word error rate (WER) reduction of 6% on utterances whose transcripts appear in the candidate set, without increasing WER on general utterances.
doi_str_mv	10.48550/arxiv.2401.04235
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2401_04235</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2401_04235</sourcerecordid><originalsourceid>FETCH-LOGICAL-a675-8bd67dbfe77147f9854dd9de7c857595dc16c0d7f7f492a92cb1f1ed845773903</originalsourceid><addsrcrecordid>eNotj71uwjAURr10qKAP0Kl-gaR2YufaI4poqYRUlaKukX-uiSUgkZNG8PYFyvQN59ORDiHPnOVCScleTTrFKS8E4zkTRSkfyWYVd23WJ3RxiN2R_nTRIf1Gk1xLv34xnWndpQser3SKhm5wTBEnY_eXX4_o2mzE00iXB4s-HnfDnDwEsx_w6b4zsn1bbutVtv58_6gX68xUIDNlfQXeBgTgAoJWUnivPYJTEqSW3vHKMQ8BgtCF0YWzPHD0SkiAUrNyRl7-tbeopk_xYNK5ucY1t7jyDw6MSn4</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>High-precision Voice Search Query Correction via Retrievable Speech-text Embedings</title><source>arXiv.org</source><creator>Li, Christopher ; Wang, Gary ; Kastner, Kyle ; Su, Heng ; Chen, Allen ; Rosenberg, Andrew ; Chen, Zhehuai ; Wu, Zelin ; Velikovich, Leonid ; Rondon, Pat ; Caseiro, Diamantino ; Aleksic, Petar</creator><creatorcontrib>Li, Christopher ; Wang, Gary ; Kastner, Kyle ; Su, Heng ; Chen, Allen ; Rosenberg, Andrew ; Chen, Zhehuai ; Wu, Zelin ; Velikovich, Leonid ; Rondon, Pat ; Caseiro, Diamantino ; Aleksic, Petar</creatorcontrib><description>Automatic speech recognition (ASR) systems can suffer from poor recall for various reasons, such as noisy audio, lack of sufficient training data, etc. Previous work has shown that recall can be improved by retrieving rewrite candidates from a large database of likely, contextually-relevant alternatives to the hypothesis text using nearest-neighbors search over embeddings of the ASR hypothesis text to correct and candidate corrections. However, ASR-hypothesis-based retrieval can yield poor precision if the textual hypotheses are too phonetically dissimilar to the transcript truth. In this paper, we eliminate the hypothesis-audio mismatch problem by querying the correction database directly using embeddings derived from the utterance audio; the embeddings of the utterance audio and candidate corrections are produced by multimodal speech-text embedding networks trained to place the embedding of the audio of an utterance and the embedding of its corresponding textual transcript close together. After locating an appropriate correction candidate using nearest-neighbor search, we score the candidate with its speech-text embedding distance before adding the candidate to the original n-best list. We show a relative word error rate (WER) reduction of 6% on utterances whose transcripts appear in the candidate set, without increasing WER on general utterances.</description><identifier>DOI: 10.48550/arxiv.2401.04235</identifier><language>eng</language><subject>Computer Science - Computation and Language ; Computer Science - Sound</subject><creationdate>2024-01</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2401.04235$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2401.04235$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Li, Christopher</creatorcontrib><creatorcontrib>Wang, Gary</creatorcontrib><creatorcontrib>Kastner, Kyle</creatorcontrib><creatorcontrib>Su, Heng</creatorcontrib><creatorcontrib>Chen, Allen</creatorcontrib><creatorcontrib>Rosenberg, Andrew</creatorcontrib><creatorcontrib>Chen, Zhehuai</creatorcontrib><creatorcontrib>Wu, Zelin</creatorcontrib><creatorcontrib>Velikovich, Leonid</creatorcontrib><creatorcontrib>Rondon, Pat</creatorcontrib><creatorcontrib>Caseiro, Diamantino</creatorcontrib><creatorcontrib>Aleksic, Petar</creatorcontrib><title>High-precision Voice Search Query Correction via Retrievable Speech-text Embedings</title><description>Automatic speech recognition (ASR) systems can suffer from poor recall for various reasons, such as noisy audio, lack of sufficient training data, etc. Previous work has shown that recall can be improved by retrieving rewrite candidates from a large database of likely, contextually-relevant alternatives to the hypothesis text using nearest-neighbors search over embeddings of the ASR hypothesis text to correct and candidate corrections. However, ASR-hypothesis-based retrieval can yield poor precision if the textual hypotheses are too phonetically dissimilar to the transcript truth. In this paper, we eliminate the hypothesis-audio mismatch problem by querying the correction database directly using embeddings derived from the utterance audio; the embeddings of the utterance audio and candidate corrections are produced by multimodal speech-text embedding networks trained to place the embedding of the audio of an utterance and the embedding of its corresponding textual transcript close together. After locating an appropriate correction candidate using nearest-neighbor search, we score the candidate with its speech-text embedding distance before adding the candidate to the original n-best list. We show a relative word error rate (WER) reduction of 6% on utterances whose transcripts appear in the candidate set, without increasing WER on general utterances.</description><subject>Computer Science - Computation and Language</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotj71uwjAURr10qKAP0Kl-gaR2YufaI4poqYRUlaKukX-uiSUgkZNG8PYFyvQN59ORDiHPnOVCScleTTrFKS8E4zkTRSkfyWYVd23WJ3RxiN2R_nTRIf1Gk1xLv34xnWndpQser3SKhm5wTBEnY_eXX4_o2mzE00iXB4s-HnfDnDwEsx_w6b4zsn1bbutVtv58_6gX68xUIDNlfQXeBgTgAoJWUnivPYJTEqSW3vHKMQ8BgtCF0YWzPHD0SkiAUrNyRl7-tbeopk_xYNK5ucY1t7jyDw6MSn4</recordid><startdate>20240108</startdate><enddate>20240108</enddate><creator>Li, Christopher</creator><creator>Wang, Gary</creator><creator>Kastner, Kyle</creator><creator>Su, Heng</creator><creator>Chen, Allen</creator><creator>Rosenberg, Andrew</creator><creator>Chen, Zhehuai</creator><creator>Wu, Zelin</creator><creator>Velikovich, Leonid</creator><creator>Rondon, Pat</creator><creator>Caseiro, Diamantino</creator><creator>Aleksic, Petar</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240108</creationdate><title>High-precision Voice Search Query Correction via Retrievable Speech-text Embedings</title><author>Li, Christopher ; Wang, Gary ; Kastner, Kyle ; Su, Heng ; Chen, Allen ; Rosenberg, Andrew ; Chen, Zhehuai ; Wu, Zelin ; Velikovich, Leonid ; Rondon, Pat ; Caseiro, Diamantino ; Aleksic, Petar</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a675-8bd67dbfe77147f9854dd9de7c857595dc16c0d7f7f492a92cb1f1ed845773903</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Computation and Language</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Li, Christopher</creatorcontrib><creatorcontrib>Wang, Gary</creatorcontrib><creatorcontrib>Kastner, Kyle</creatorcontrib><creatorcontrib>Su, Heng</creatorcontrib><creatorcontrib>Chen, Allen</creatorcontrib><creatorcontrib>Rosenberg, Andrew</creatorcontrib><creatorcontrib>Chen, Zhehuai</creatorcontrib><creatorcontrib>Wu, Zelin</creatorcontrib><creatorcontrib>Velikovich, Leonid</creatorcontrib><creatorcontrib>Rondon, Pat</creatorcontrib><creatorcontrib>Caseiro, Diamantino</creatorcontrib><creatorcontrib>Aleksic, Petar</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Li, Christopher</au><au>Wang, Gary</au><au>Kastner, Kyle</au><au>Su, Heng</au><au>Chen, Allen</au><au>Rosenberg, Andrew</au><au>Chen, Zhehuai</au><au>Wu, Zelin</au><au>Velikovich, Leonid</au><au>Rondon, Pat</au><au>Caseiro, Diamantino</au><au>Aleksic, Petar</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>High-precision Voice Search Query Correction via Retrievable Speech-text Embedings</atitle><date>2024-01-08</date><risdate>2024</risdate><abstract>Automatic speech recognition (ASR) systems can suffer from poor recall for various reasons, such as noisy audio, lack of sufficient training data, etc. Previous work has shown that recall can be improved by retrieving rewrite candidates from a large database of likely, contextually-relevant alternatives to the hypothesis text using nearest-neighbors search over embeddings of the ASR hypothesis text to correct and candidate corrections. However, ASR-hypothesis-based retrieval can yield poor precision if the textual hypotheses are too phonetically dissimilar to the transcript truth. In this paper, we eliminate the hypothesis-audio mismatch problem by querying the correction database directly using embeddings derived from the utterance audio; the embeddings of the utterance audio and candidate corrections are produced by multimodal speech-text embedding networks trained to place the embedding of the audio of an utterance and the embedding of its corresponding textual transcript close together. After locating an appropriate correction candidate using nearest-neighbor search, we score the candidate with its speech-text embedding distance before adding the candidate to the original n-best list. We show a relative word error rate (WER) reduction of 6% on utterances whose transcripts appear in the candidate set, without increasing WER on general utterances.</abstract><doi>10.48550/arxiv.2401.04235</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2401.04235
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2401_04235
source	arXiv.org
subjects	Computer Science - Computation and Language Computer Science - Sound
title	High-precision Voice Search Query Correction via Retrievable Speech-text Embedings
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-25T20%3A30%3A23IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=High-precision%20Voice%20Search%20Query%20Correction%20via%20Retrievable%20Speech-text%20Embedings&rft.au=Li,%20Christopher&rft.date=2024-01-08&rft_id=info:doi/10.48550/arxiv.2401.04235&rft_dat=%3Carxiv_GOX%3E2401_04235%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true