High Fidelity Speech Regeneration with Application to Speech Enhancement

Speech enhancement has seen great improvement in recent years mainly through contributions in denoising, speaker separation, and dereverberation methods that mostly deal with environmental effects on vocal audio. To enhance speech beyond the limitations of the original signal, we take a regeneration...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Polyak, Adam, Wolf, Lior, Adi, Yossi, Kabeli, Ori, Taigman, Yaniv
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Learning Computer Science - Sound
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Polyak, Adam Wolf, Lior Adi, Yossi Kabeli, Ori Taigman, Yaniv
description	Speech enhancement has seen great improvement in recent years mainly through contributions in denoising, speaker separation, and dereverberation methods that mostly deal with environmental effects on vocal audio. To enhance speech beyond the limitations of the original signal, we take a regeneration approach, in which we recreate the speech from its essence, including the semi-recognized speech, prosody features, and identity. We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner and which utilizes a compact speech representation, composed of ASR and identity features, to achieve a higher level of intelligibility. Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source using an auxiliary identity network. Perceptual acoustic metrics and subjective tests show that the method obtains valuable improvements over recent baselines.
doi_str_mv	10.48550/arxiv.2102.00429
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2102_00429</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2102_00429</sourcerecordid><originalsourceid>FETCH-LOGICAL-a679-373f25ddd3f426d5f6d9c391ad782bdaffb7c0921c2015f5f2a835da3bbf5f043</originalsourceid><addsrcrecordid>eNo1j81qwkAUhWfTRbF9gK6cF0g6P5kksxRRUxAK6j7czL1jBuIY0qHVt2-r7epwDh8HPsZepMiL2hjxCtMlfOZKCpULUSj7yJomHHu-DkhDSFe-H4lcz3d0pEgTpHCO_Cukni_GcQjuPqTzP7eKPURHJ4rpiT14GD7o-S9n7LBeHZZNtn3fvC0X2wzKyma60l4ZRNS-UCUaX6J12krAqlYdgvdd5YRV0ikhjTdeQa0Ngu66nyIKPWPz--1NpR2ncILp2v4qtTcl_Q2zLUc-</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>High Fidelity Speech Regeneration with Application to Speech Enhancement</title><source>arXiv.org</source><creator>Polyak, Adam ; Wolf, Lior ; Adi, Yossi ; Kabeli, Ori ; Taigman, Yaniv</creator><creatorcontrib>Polyak, Adam ; Wolf, Lior ; Adi, Yossi ; Kabeli, Ori ; Taigman, Yaniv</creatorcontrib><description>Speech enhancement has seen great improvement in recent years mainly through contributions in denoising, speaker separation, and dereverberation methods that mostly deal with environmental effects on vocal audio. To enhance speech beyond the limitations of the original signal, we take a regeneration approach, in which we recreate the speech from its essence, including the semi-recognized speech, prosody features, and identity. We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner and which utilizes a compact speech representation, composed of ASR and identity features, to achieve a higher level of intelligibility. Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source using an auxiliary identity network. Perceptual acoustic metrics and subjective tests show that the method obtains valuable improvements over recent baselines.</description><identifier>DOI: 10.48550/arxiv.2102.00429</identifier><language>eng</language><subject>Computer Science - Learning ; Computer Science - Sound</subject><creationdate>2021-01</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2102.00429$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2102.00429$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Polyak, Adam</creatorcontrib><creatorcontrib>Wolf, Lior</creatorcontrib><creatorcontrib>Adi, Yossi</creatorcontrib><creatorcontrib>Kabeli, Ori</creatorcontrib><creatorcontrib>Taigman, Yaniv</creatorcontrib><title>High Fidelity Speech Regeneration with Application to Speech Enhancement</title><description>Speech enhancement has seen great improvement in recent years mainly through contributions in denoising, speaker separation, and dereverberation methods that mostly deal with environmental effects on vocal audio. To enhance speech beyond the limitations of the original signal, we take a regeneration approach, in which we recreate the speech from its essence, including the semi-recognized speech, prosody features, and identity. We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner and which utilizes a compact speech representation, composed of ASR and identity features, to achieve a higher level of intelligibility. Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source using an auxiliary identity network. Perceptual acoustic metrics and subjective tests show that the method obtains valuable improvements over recent baselines.</description><subject>Computer Science - Learning</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNo1j81qwkAUhWfTRbF9gK6cF0g6P5kksxRRUxAK6j7czL1jBuIY0qHVt2-r7epwDh8HPsZepMiL2hjxCtMlfOZKCpULUSj7yJomHHu-DkhDSFe-H4lcz3d0pEgTpHCO_Cukni_GcQjuPqTzP7eKPURHJ4rpiT14GD7o-S9n7LBeHZZNtn3fvC0X2wzKyma60l4ZRNS-UCUaX6J12krAqlYdgvdd5YRV0ikhjTdeQa0Ngu66nyIKPWPz--1NpR2ncILp2v4qtTcl_Q2zLUc-</recordid><startdate>20210131</startdate><enddate>20210131</enddate><creator>Polyak, Adam</creator><creator>Wolf, Lior</creator><creator>Adi, Yossi</creator><creator>Kabeli, Ori</creator><creator>Taigman, Yaniv</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20210131</creationdate><title>High Fidelity Speech Regeneration with Application to Speech Enhancement</title><author>Polyak, Adam ; Wolf, Lior ; Adi, Yossi ; Kabeli, Ori ; Taigman, Yaniv</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a679-373f25ddd3f426d5f6d9c391ad782bdaffb7c0921c2015f5f2a835da3bbf5f043</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Computer Science - Learning</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Polyak, Adam</creatorcontrib><creatorcontrib>Wolf, Lior</creatorcontrib><creatorcontrib>Adi, Yossi</creatorcontrib><creatorcontrib>Kabeli, Ori</creatorcontrib><creatorcontrib>Taigman, Yaniv</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Polyak, Adam</au><au>Wolf, Lior</au><au>Adi, Yossi</au><au>Kabeli, Ori</au><au>Taigman, Yaniv</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>High Fidelity Speech Regeneration with Application to Speech Enhancement</atitle><date>2021-01-31</date><risdate>2021</risdate><abstract>Speech enhancement has seen great improvement in recent years mainly through contributions in denoising, speaker separation, and dereverberation methods that mostly deal with environmental effects on vocal audio. To enhance speech beyond the limitations of the original signal, we take a regeneration approach, in which we recreate the speech from its essence, including the semi-recognized speech, prosody features, and identity. We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner and which utilizes a compact speech representation, composed of ASR and identity features, to achieve a higher level of intelligibility. Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source using an auxiliary identity network. Perceptual acoustic metrics and subjective tests show that the method obtains valuable improvements over recent baselines.</abstract><doi>10.48550/arxiv.2102.00429</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2102.00429
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2102_00429
source	arXiv.org
subjects	Computer Science - Learning Computer Science - Sound
title	High Fidelity Speech Regeneration with Application to Speech Enhancement
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-11T13%3A53%3A52IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=High%20Fidelity%20Speech%20Regeneration%20with%20Application%20to%20Speech%20Enhancement&rft.au=Polyak,%20Adam&rft.date=2021-01-31&rft_id=info:doi/10.48550/arxiv.2102.00429&rft_dat=%3Carxiv_GOX%3E2102_00429%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true