High Fidelity Speech Regeneration with Application to Speech Enhancement
Speech enhancement has seen great improvement in recent years mainly through contributions in denoising, speaker separation, and dereverberation methods that mostly deal with environmental effects on vocal audio. To enhance speech beyond the limitations of the original signal, we take a regeneration...
Gespeichert in:
Hauptverfasser: | , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Polyak, Adam Wolf, Lior Adi, Yossi Kabeli, Ori Taigman, Yaniv |
description | Speech enhancement has seen great improvement in recent years mainly through
contributions in denoising, speaker separation, and dereverberation methods
that mostly deal with environmental effects on vocal audio. To enhance speech
beyond the limitations of the original signal, we take a regeneration approach,
in which we recreate the speech from its essence, including the semi-recognized
speech, prosody features, and identity. We propose a wav-to-wav generative
model for speech that can generate 24khz speech in a real-time manner and which
utilizes a compact speech representation, composed of ASR and identity
features, to achieve a higher level of intelligibility. Inspired by voice
conversion methods, we train to augment the speech characteristics while
preserving the identity of the source using an auxiliary identity network.
Perceptual acoustic metrics and subjective tests show that the method obtains
valuable improvements over recent baselines. |
doi_str_mv | 10.48550/arxiv.2102.00429 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2102_00429</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2102_00429</sourcerecordid><originalsourceid>FETCH-LOGICAL-a679-373f25ddd3f426d5f6d9c391ad782bdaffb7c0921c2015f5f2a835da3bbf5f043</originalsourceid><addsrcrecordid>eNo1j81qwkAUhWfTRbF9gK6cF0g6P5kksxRRUxAK6j7czL1jBuIY0qHVt2-r7epwDh8HPsZepMiL2hjxCtMlfOZKCpULUSj7yJomHHu-DkhDSFe-H4lcz3d0pEgTpHCO_Cukni_GcQjuPqTzP7eKPURHJ4rpiT14GD7o-S9n7LBeHZZNtn3fvC0X2wzKyma60l4ZRNS-UCUaX6J12krAqlYdgvdd5YRV0ikhjTdeQa0Ngu66nyIKPWPz--1NpR2ncILp2v4qtTcl_Q2zLUc-</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>High Fidelity Speech Regeneration with Application to Speech Enhancement</title><source>arXiv.org</source><creator>Polyak, Adam ; Wolf, Lior ; Adi, Yossi ; Kabeli, Ori ; Taigman, Yaniv</creator><creatorcontrib>Polyak, Adam ; Wolf, Lior ; Adi, Yossi ; Kabeli, Ori ; Taigman, Yaniv</creatorcontrib><description>Speech enhancement has seen great improvement in recent years mainly through
contributions in denoising, speaker separation, and dereverberation methods
that mostly deal with environmental effects on vocal audio. To enhance speech
beyond the limitations of the original signal, we take a regeneration approach,
in which we recreate the speech from its essence, including the semi-recognized
speech, prosody features, and identity. We propose a wav-to-wav generative
model for speech that can generate 24khz speech in a real-time manner and which
utilizes a compact speech representation, composed of ASR and identity
features, to achieve a higher level of intelligibility. Inspired by voice
conversion methods, we train to augment the speech characteristics while
preserving the identity of the source using an auxiliary identity network.
Perceptual acoustic metrics and subjective tests show that the method obtains
valuable improvements over recent baselines.</description><identifier>DOI: 10.48550/arxiv.2102.00429</identifier><language>eng</language><subject>Computer Science - Learning ; Computer Science - Sound</subject><creationdate>2021-01</creationdate><rights>http://arxiv.org/licenses/nonexclusive-distrib/1.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2102.00429$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2102.00429$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Polyak, Adam</creatorcontrib><creatorcontrib>Wolf, Lior</creatorcontrib><creatorcontrib>Adi, Yossi</creatorcontrib><creatorcontrib>Kabeli, Ori</creatorcontrib><creatorcontrib>Taigman, Yaniv</creatorcontrib><title>High Fidelity Speech Regeneration with Application to Speech Enhancement</title><description>Speech enhancement has seen great improvement in recent years mainly through
contributions in denoising, speaker separation, and dereverberation methods
that mostly deal with environmental effects on vocal audio. To enhance speech
beyond the limitations of the original signal, we take a regeneration approach,
in which we recreate the speech from its essence, including the semi-recognized
speech, prosody features, and identity. We propose a wav-to-wav generative
model for speech that can generate 24khz speech in a real-time manner and which
utilizes a compact speech representation, composed of ASR and identity
features, to achieve a higher level of intelligibility. Inspired by voice
conversion methods, we train to augment the speech characteristics while
preserving the identity of the source using an auxiliary identity network.
Perceptual acoustic metrics and subjective tests show that the method obtains
valuable improvements over recent baselines.</description><subject>Computer Science - Learning</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNo1j81qwkAUhWfTRbF9gK6cF0g6P5kksxRRUxAK6j7czL1jBuIY0qHVt2-r7epwDh8HPsZepMiL2hjxCtMlfOZKCpULUSj7yJomHHu-DkhDSFe-H4lcz3d0pEgTpHCO_Cukni_GcQjuPqTzP7eKPURHJ4rpiT14GD7o-S9n7LBeHZZNtn3fvC0X2wzKyma60l4ZRNS-UCUaX6J12krAqlYdgvdd5YRV0ikhjTdeQa0Ngu66nyIKPWPz--1NpR2ncILp2v4qtTcl_Q2zLUc-</recordid><startdate>20210131</startdate><enddate>20210131</enddate><creator>Polyak, Adam</creator><creator>Wolf, Lior</creator><creator>Adi, Yossi</creator><creator>Kabeli, Ori</creator><creator>Taigman, Yaniv</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20210131</creationdate><title>High Fidelity Speech Regeneration with Application to Speech Enhancement</title><author>Polyak, Adam ; Wolf, Lior ; Adi, Yossi ; Kabeli, Ori ; Taigman, Yaniv</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a679-373f25ddd3f426d5f6d9c391ad782bdaffb7c0921c2015f5f2a835da3bbf5f043</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Computer Science - Learning</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Polyak, Adam</creatorcontrib><creatorcontrib>Wolf, Lior</creatorcontrib><creatorcontrib>Adi, Yossi</creatorcontrib><creatorcontrib>Kabeli, Ori</creatorcontrib><creatorcontrib>Taigman, Yaniv</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Polyak, Adam</au><au>Wolf, Lior</au><au>Adi, Yossi</au><au>Kabeli, Ori</au><au>Taigman, Yaniv</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>High Fidelity Speech Regeneration with Application to Speech Enhancement</atitle><date>2021-01-31</date><risdate>2021</risdate><abstract>Speech enhancement has seen great improvement in recent years mainly through
contributions in denoising, speaker separation, and dereverberation methods
that mostly deal with environmental effects on vocal audio. To enhance speech
beyond the limitations of the original signal, we take a regeneration approach,
in which we recreate the speech from its essence, including the semi-recognized
speech, prosody features, and identity. We propose a wav-to-wav generative
model for speech that can generate 24khz speech in a real-time manner and which
utilizes a compact speech representation, composed of ASR and identity
features, to achieve a higher level of intelligibility. Inspired by voice
conversion methods, we train to augment the speech characteristics while
preserving the identity of the source using an auxiliary identity network.
Perceptual acoustic metrics and subjective tests show that the method obtains
valuable improvements over recent baselines.</abstract><doi>10.48550/arxiv.2102.00429</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2102.00429 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2102_00429 |
source | arXiv.org |
subjects | Computer Science - Learning Computer Science - Sound |
title | High Fidelity Speech Regeneration with Application to Speech Enhancement |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-11T13%3A53%3A52IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=High%20Fidelity%20Speech%20Regeneration%20with%20Application%20to%20Speech%20Enhancement&rft.au=Polyak,%20Adam&rft.date=2021-01-31&rft_id=info:doi/10.48550/arxiv.2102.00429&rft_dat=%3Carxiv_GOX%3E2102_00429%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |