Audio Super-Resolution With Robust Speech Representation Learning of Masked Autoencoder
This paper proposes Fre-Painter, a high-fidelity audio super-resolution system that utilizes robust speech representation learning with various masking strategies. Recently, masked autoencoders have been found to be beneficial in learning robust representations of audio for speech classification tas...
Gespeichert in:
Veröffentlicht in: | IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2024-01, Vol.32, p.1-11 |
---|---|
Hauptverfasser: | , , , |
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | 11 |
---|---|
container_issue | |
container_start_page | 1 |
container_title | IEEE/ACM transactions on audio, speech, and language processing |
container_volume | 32 |
creator | Kim, Seung-Bin Lee, Sang-Hoon Choi, Ha-Yeong Lee, Seong-Whan |
description | This paper proposes Fre-Painter, a high-fidelity audio super-resolution system that utilizes robust speech representation learning with various masking strategies. Recently, masked autoencoders have been found to be beneficial in learning robust representations of audio for speech classification tasks. Following these studies, we leverage these representations and investigate several masking strategies for neural audio super-resolution. In this paper, we propose an upper-band masking strategy with the initialization of the mask token, which is simple but efficient for audio super-resolution. Furthermore, we propose a mix-ratio masking strategy that makes the model robust for input speech with various sampling rates. For practical applicability, we extend Fre-Painter to a text-to-speech system, which synthesizes high-resolution speech using low-resolution speech data. The experimental results demonstrate that Fre-Painter outperforms other neural audio super-resolution models. |
doi_str_mv | 10.1109/TASLP.2023.3349053 |
format | Article |
fullrecord | <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1109_TASLP_2023_3349053</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10381805</ieee_id><sourcerecordid>2912944681</sourcerecordid><originalsourceid>FETCH-LOGICAL-c291t-fd3dcad998749b387d5d22cb345c699ac0b31e22abe3dac6befc44cccb185d173</originalsourceid><addsrcrecordid>eNpNkN1LwzAUxYMoOOb-AfEh4HNnvro2j2WoEyrKNtljSJNb7ZxNTdIH_3u7D8Gney_nnHvgh9A1JVNKibxbF6vydcoI41POhSQpP0MjxplMJCfi_G9nklyiSQhbQgglmZSZGKFN0dvG4VXfgU-WENyuj41r8aaJH3jpqj5EvOoAzHBB5yFAG_XBUYL2bdO-Y1fjZx0-weKijw5a4yz4K3RR612AyWmO0dvD_Xq-SMqXx6d5USaGSRqT2nJrtJUyz4SseJ7Z1DJmKi5SM5NSG1JxCozpCrjVZlZBbYQwxlQ0Ty3N-BjdHv923n33EKLaut63Q6UaCpgUYpbTwcWOLuNdCB5q1fnmS_sfRYnaM1QHhmrPUJ0YDqGbY6gBgH8BntN80H8B02Bu9Q</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2912944681</pqid></control><display><type>article</type><title>Audio Super-Resolution With Robust Speech Representation Learning of Masked Autoencoder</title><source>ACM Digital Library</source><source>IEEE Electronic Library (IEL)</source><creator>Kim, Seung-Bin ; Lee, Sang-Hoon ; Choi, Ha-Yeong ; Lee, Seong-Whan</creator><creatorcontrib>Kim, Seung-Bin ; Lee, Sang-Hoon ; Choi, Ha-Yeong ; Lee, Seong-Whan</creatorcontrib><description>This paper proposes Fre-Painter, a high-fidelity audio super-resolution system that utilizes robust speech representation learning with various masking strategies. Recently, masked autoencoders have been found to be beneficial in learning robust representations of audio for speech classification tasks. Following these studies, we leverage these representations and investigate several masking strategies for neural audio super-resolution. In this paper, we propose an upper-band masking strategy with the initialization of the mask token, which is simple but efficient for audio super-resolution. Furthermore, we propose a mix-ratio masking strategy that makes the model robust for input speech with various sampling rates. For practical applicability, we extend Fre-Painter to a text-to-speech system, which synthesizes high-resolution speech using low-resolution speech data. The experimental results demonstrate that Fre-Painter outperforms other neural audio super-resolution models.</description><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASLP.2023.3349053</identifier><identifier>CODEN: ITASFA</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Audio data ; Audio super-resolution ; audio synthesis ; bandwidth extension ; Computational modeling ; Decoding ; Machine learning ; masked autoencoder ; Masking ; Representations ; Robustness ; Self-supervised learning ; Speech ; Speech processing ; Superresolution ; Task analysis ; Training</subject><ispartof>IEEE/ACM transactions on audio, speech, and language processing, 2024-01, Vol.32, p.1-11</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c291t-fd3dcad998749b387d5d22cb345c699ac0b31e22abe3dac6befc44cccb185d173</cites><orcidid>0000-0002-8925-4474 ; 0000-0002-2287-9111 ; 0000-0002-6249-4996 ; 0000-0003-2390-7628</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10381805$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,796,27915,27916,54749</link.rule.ids></links><search><creatorcontrib>Kim, Seung-Bin</creatorcontrib><creatorcontrib>Lee, Sang-Hoon</creatorcontrib><creatorcontrib>Choi, Ha-Yeong</creatorcontrib><creatorcontrib>Lee, Seong-Whan</creatorcontrib><title>Audio Super-Resolution With Robust Speech Representation Learning of Masked Autoencoder</title><title>IEEE/ACM transactions on audio, speech, and language processing</title><addtitle>TASLP</addtitle><description>This paper proposes Fre-Painter, a high-fidelity audio super-resolution system that utilizes robust speech representation learning with various masking strategies. Recently, masked autoencoders have been found to be beneficial in learning robust representations of audio for speech classification tasks. Following these studies, we leverage these representations and investigate several masking strategies for neural audio super-resolution. In this paper, we propose an upper-band masking strategy with the initialization of the mask token, which is simple but efficient for audio super-resolution. Furthermore, we propose a mix-ratio masking strategy that makes the model robust for input speech with various sampling rates. For practical applicability, we extend Fre-Painter to a text-to-speech system, which synthesizes high-resolution speech using low-resolution speech data. The experimental results demonstrate that Fre-Painter outperforms other neural audio super-resolution models.</description><subject>Audio data</subject><subject>Audio super-resolution</subject><subject>audio synthesis</subject><subject>bandwidth extension</subject><subject>Computational modeling</subject><subject>Decoding</subject><subject>Machine learning</subject><subject>masked autoencoder</subject><subject>Masking</subject><subject>Representations</subject><subject>Robustness</subject><subject>Self-supervised learning</subject><subject>Speech</subject><subject>Speech processing</subject><subject>Superresolution</subject><subject>Task analysis</subject><subject>Training</subject><issn>2329-9290</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><recordid>eNpNkN1LwzAUxYMoOOb-AfEh4HNnvro2j2WoEyrKNtljSJNb7ZxNTdIH_3u7D8Gney_nnHvgh9A1JVNKibxbF6vydcoI41POhSQpP0MjxplMJCfi_G9nklyiSQhbQgglmZSZGKFN0dvG4VXfgU-WENyuj41r8aaJH3jpqj5EvOoAzHBB5yFAG_XBUYL2bdO-Y1fjZx0-weKijw5a4yz4K3RR612AyWmO0dvD_Xq-SMqXx6d5USaGSRqT2nJrtJUyz4SseJ7Z1DJmKi5SM5NSG1JxCozpCrjVZlZBbYQwxlQ0Ty3N-BjdHv923n33EKLaut63Q6UaCpgUYpbTwcWOLuNdCB5q1fnmS_sfRYnaM1QHhmrPUJ0YDqGbY6gBgH8BntN80H8B02Bu9Q</recordid><startdate>20240101</startdate><enddate>20240101</enddate><creator>Kim, Seung-Bin</creator><creator>Lee, Sang-Hoon</creator><creator>Choi, Ha-Yeong</creator><creator>Lee, Seong-Whan</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-8925-4474</orcidid><orcidid>https://orcid.org/0000-0002-2287-9111</orcidid><orcidid>https://orcid.org/0000-0002-6249-4996</orcidid><orcidid>https://orcid.org/0000-0003-2390-7628</orcidid></search><sort><creationdate>20240101</creationdate><title>Audio Super-Resolution With Robust Speech Representation Learning of Masked Autoencoder</title><author>Kim, Seung-Bin ; Lee, Sang-Hoon ; Choi, Ha-Yeong ; Lee, Seong-Whan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c291t-fd3dcad998749b387d5d22cb345c699ac0b31e22abe3dac6befc44cccb185d173</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Audio data</topic><topic>Audio super-resolution</topic><topic>audio synthesis</topic><topic>bandwidth extension</topic><topic>Computational modeling</topic><topic>Decoding</topic><topic>Machine learning</topic><topic>masked autoencoder</topic><topic>Masking</topic><topic>Representations</topic><topic>Robustness</topic><topic>Self-supervised learning</topic><topic>Speech</topic><topic>Speech processing</topic><topic>Superresolution</topic><topic>Task analysis</topic><topic>Training</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Kim, Seung-Bin</creatorcontrib><creatorcontrib>Lee, Sang-Hoon</creatorcontrib><creatorcontrib>Choi, Ha-Yeong</creatorcontrib><creatorcontrib>Lee, Seong-Whan</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kim, Seung-Bin</au><au>Lee, Sang-Hoon</au><au>Choi, Ha-Yeong</au><au>Lee, Seong-Whan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Audio Super-Resolution With Robust Speech Representation Learning of Masked Autoencoder</atitle><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle><stitle>TASLP</stitle><date>2024-01-01</date><risdate>2024</risdate><volume>32</volume><spage>1</spage><epage>11</epage><pages>1-11</pages><issn>2329-9290</issn><eissn>2329-9304</eissn><coden>ITASFA</coden><abstract>This paper proposes Fre-Painter, a high-fidelity audio super-resolution system that utilizes robust speech representation learning with various masking strategies. Recently, masked autoencoders have been found to be beneficial in learning robust representations of audio for speech classification tasks. Following these studies, we leverage these representations and investigate several masking strategies for neural audio super-resolution. In this paper, we propose an upper-band masking strategy with the initialization of the mask token, which is simple but efficient for audio super-resolution. Furthermore, we propose a mix-ratio masking strategy that makes the model robust for input speech with various sampling rates. For practical applicability, we extend Fre-Painter to a text-to-speech system, which synthesizes high-resolution speech using low-resolution speech data. The experimental results demonstrate that Fre-Painter outperforms other neural audio super-resolution models.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TASLP.2023.3349053</doi><tpages>11</tpages><orcidid>https://orcid.org/0000-0002-8925-4474</orcidid><orcidid>https://orcid.org/0000-0002-2287-9111</orcidid><orcidid>https://orcid.org/0000-0002-6249-4996</orcidid><orcidid>https://orcid.org/0000-0003-2390-7628</orcidid><oa>free_for_read</oa></addata></record> |
fulltext | fulltext |
identifier | ISSN: 2329-9290 |
ispartof | IEEE/ACM transactions on audio, speech, and language processing, 2024-01, Vol.32, p.1-11 |
issn | 2329-9290 2329-9304 |
language | eng |
recordid | cdi_crossref_primary_10_1109_TASLP_2023_3349053 |
source | ACM Digital Library; IEEE Electronic Library (IEL) |
subjects | Audio data Audio super-resolution audio synthesis bandwidth extension Computational modeling Decoding Machine learning masked autoencoder Masking Representations Robustness Self-supervised learning Speech Speech processing Superresolution Task analysis Training |
title | Audio Super-Resolution With Robust Speech Representation Learning of Masked Autoencoder |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-15T03%3A31%3A02IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Audio%20Super-Resolution%20With%20Robust%20Speech%20Representation%20Learning%20of%20Masked%20Autoencoder&rft.jtitle=IEEE/ACM%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Kim,%20Seung-Bin&rft.date=2024-01-01&rft.volume=32&rft.spage=1&rft.epage=11&rft.pages=1-11&rft.issn=2329-9290&rft.eissn=2329-9304&rft.coden=ITASFA&rft_id=info:doi/10.1109/TASLP.2023.3349053&rft_dat=%3Cproquest_cross%3E2912944681%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2912944681&rft_id=info:pmid/&rft_ieee_id=10381805&rfr_iscdi=true |