High Fidelity Speech Regeneration with Application to Speech Enhancement

Speech enhancement has seen great improvement in recent years mainly through contributions in denoising, speaker separation, and dereverberation methods that mostly deal with environmental effects on vocal audio. To enhance speech beyond the limitations of the original signal, we take a regeneration...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Hauptverfasser: Polyak, Adam, Wolf, Lior, Adi, Yossi, Kabeli, Ori, Taigman, Yaniv
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page
container_issue
container_start_page
container_title
container_volume
creator Polyak, Adam
Wolf, Lior
Adi, Yossi
Kabeli, Ori
Taigman, Yaniv
description Speech enhancement has seen great improvement in recent years mainly through contributions in denoising, speaker separation, and dereverberation methods that mostly deal with environmental effects on vocal audio. To enhance speech beyond the limitations of the original signal, we take a regeneration approach, in which we recreate the speech from its essence, including the semi-recognized speech, prosody features, and identity. We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner and which utilizes a compact speech representation, composed of ASR and identity features, to achieve a higher level of intelligibility. Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source using an auxiliary identity network. Perceptual acoustic metrics and subjective tests show that the method obtains valuable improvements over recent baselines.
doi_str_mv 10.48550/arxiv.2102.00429
format Article
fullrecord <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2102_00429</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2102_00429</sourcerecordid><originalsourceid>FETCH-LOGICAL-a679-373f25ddd3f426d5f6d9c391ad782bdaffb7c0921c2015f5f2a835da3bbf5f043</originalsourceid><addsrcrecordid>eNo1j81qwkAUhWfTRbF9gK6cF0g6P5kksxRRUxAK6j7czL1jBuIY0qHVt2-r7epwDh8HPsZepMiL2hjxCtMlfOZKCpULUSj7yJomHHu-DkhDSFe-H4lcz3d0pEgTpHCO_Cukni_GcQjuPqTzP7eKPURHJ4rpiT14GD7o-S9n7LBeHZZNtn3fvC0X2wzKyma60l4ZRNS-UCUaX6J12krAqlYdgvdd5YRV0ikhjTdeQa0Ngu66nyIKPWPz--1NpR2ncILp2v4qtTcl_Q2zLUc-</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>High Fidelity Speech Regeneration with Application to Speech Enhancement</title><source>arXiv.org</source><creator>Polyak, Adam ; Wolf, Lior ; Adi, Yossi ; Kabeli, Ori ; Taigman, Yaniv</creator><creatorcontrib>Polyak, Adam ; Wolf, Lior ; Adi, Yossi ; Kabeli, Ori ; Taigman, Yaniv</creatorcontrib><description>Speech enhancement has seen great improvement in recent years mainly through contributions in denoising, speaker separation, and dereverberation methods that mostly deal with environmental effects on vocal audio. To enhance speech beyond the limitations of the original signal, we take a regeneration approach, in which we recreate the speech from its essence, including the semi-recognized speech, prosody features, and identity. We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner and which utilizes a compact speech representation, composed of ASR and identity features, to achieve a higher level of intelligibility. Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source using an auxiliary identity network. Perceptual acoustic metrics and subjective tests show that the method obtains valuable improvements over recent baselines.</description><identifier>DOI: 10.48550/arxiv.2102.00429</identifier><language>eng</language><subject>Computer Science - Learning ; Computer Science - Sound</subject><creationdate>2021-01</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2102.00429$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2102.00429$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Polyak, Adam</creatorcontrib><creatorcontrib>Wolf, Lior</creatorcontrib><creatorcontrib>Adi, Yossi</creatorcontrib><creatorcontrib>Kabeli, Ori</creatorcontrib><creatorcontrib>Taigman, Yaniv</creatorcontrib><title>High Fidelity Speech Regeneration with Application to Speech Enhancement</title><description>Speech enhancement has seen great improvement in recent years mainly through contributions in denoising, speaker separation, and dereverberation methods that mostly deal with environmental effects on vocal audio. To enhance speech beyond the limitations of the original signal, we take a regeneration approach, in which we recreate the speech from its essence, including the semi-recognized speech, prosody features, and identity. We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner and which utilizes a compact speech representation, composed of ASR and identity features, to achieve a higher level of intelligibility. Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source using an auxiliary identity network. Perceptual acoustic metrics and subjective tests show that the method obtains valuable improvements over recent baselines.</description><subject>Computer Science - Learning</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNo1j81qwkAUhWfTRbF9gK6cF0g6P5kksxRRUxAK6j7czL1jBuIY0qHVt2-r7epwDh8HPsZepMiL2hjxCtMlfOZKCpULUSj7yJomHHu-DkhDSFe-H4lcz3d0pEgTpHCO_Cukni_GcQjuPqTzP7eKPURHJ4rpiT14GD7o-S9n7LBeHZZNtn3fvC0X2wzKyma60l4ZRNS-UCUaX6J12krAqlYdgvdd5YRV0ikhjTdeQa0Ngu66nyIKPWPz--1NpR2ncILp2v4qtTcl_Q2zLUc-</recordid><startdate>20210131</startdate><enddate>20210131</enddate><creator>Polyak, Adam</creator><creator>Wolf, Lior</creator><creator>Adi, Yossi</creator><creator>Kabeli, Ori</creator><creator>Taigman, Yaniv</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20210131</creationdate><title>High Fidelity Speech Regeneration with Application to Speech Enhancement</title><author>Polyak, Adam ; Wolf, Lior ; Adi, Yossi ; Kabeli, Ori ; Taigman, Yaniv</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a679-373f25ddd3f426d5f6d9c391ad782bdaffb7c0921c2015f5f2a835da3bbf5f043</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Computer Science - Learning</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Polyak, Adam</creatorcontrib><creatorcontrib>Wolf, Lior</creatorcontrib><creatorcontrib>Adi, Yossi</creatorcontrib><creatorcontrib>Kabeli, Ori</creatorcontrib><creatorcontrib>Taigman, Yaniv</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Polyak, Adam</au><au>Wolf, Lior</au><au>Adi, Yossi</au><au>Kabeli, Ori</au><au>Taigman, Yaniv</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>High Fidelity Speech Regeneration with Application to Speech Enhancement</atitle><date>2021-01-31</date><risdate>2021</risdate><abstract>Speech enhancement has seen great improvement in recent years mainly through contributions in denoising, speaker separation, and dereverberation methods that mostly deal with environmental effects on vocal audio. To enhance speech beyond the limitations of the original signal, we take a regeneration approach, in which we recreate the speech from its essence, including the semi-recognized speech, prosody features, and identity. We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner and which utilizes a compact speech representation, composed of ASR and identity features, to achieve a higher level of intelligibility. Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source using an auxiliary identity network. Perceptual acoustic metrics and subjective tests show that the method obtains valuable improvements over recent baselines.</abstract><doi>10.48550/arxiv.2102.00429</doi><oa>free_for_read</oa></addata></record>
fulltext fulltext_linktorsrc
identifier DOI: 10.48550/arxiv.2102.00429
ispartof
issn
language eng
recordid cdi_arxiv_primary_2102_00429
source arXiv.org
subjects Computer Science - Learning
Computer Science - Sound
title High Fidelity Speech Regeneration with Application to Speech Enhancement
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-11T13%3A53%3A52IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=High%20Fidelity%20Speech%20Regeneration%20with%20Application%20to%20Speech%20Enhancement&rft.au=Polyak,%20Adam&rft.date=2021-01-31&rft_id=info:doi/10.48550/arxiv.2102.00429&rft_dat=%3Carxiv_GOX%3E2102_00429%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true