Audio Super-Resolution With Robust Speech Representation Learning of Masked Autoencoder

This paper proposes Fre-Painter, a high-fidelity audio super-resolution system that utilizes robust speech representation learning with various masking strategies. Recently, masked autoencoders have been found to be beneficial in learning robust representations of audio for speech classification tas...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE/ACM transactions on audio, speech, and language processing speech, and language processing, 2024-01, Vol.32, p.1-11
Hauptverfasser: Kim, Seung-Bin, Lee, Sang-Hoon, Choi, Ha-Yeong, Lee, Seong-Whan
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 11
container_issue
container_start_page 1
container_title IEEE/ACM transactions on audio, speech, and language processing
container_volume 32
creator Kim, Seung-Bin
Lee, Sang-Hoon
Choi, Ha-Yeong
Lee, Seong-Whan
description This paper proposes Fre-Painter, a high-fidelity audio super-resolution system that utilizes robust speech representation learning with various masking strategies. Recently, masked autoencoders have been found to be beneficial in learning robust representations of audio for speech classification tasks. Following these studies, we leverage these representations and investigate several masking strategies for neural audio super-resolution. In this paper, we propose an upper-band masking strategy with the initialization of the mask token, which is simple but efficient for audio super-resolution. Furthermore, we propose a mix-ratio masking strategy that makes the model robust for input speech with various sampling rates. For practical applicability, we extend Fre-Painter to a text-to-speech system, which synthesizes high-resolution speech using low-resolution speech data. The experimental results demonstrate that Fre-Painter outperforms other neural audio super-resolution models.
doi_str_mv 10.1109/TASLP.2023.3349053
format Article
fullrecord <record><control><sourceid>proquest_cross</sourceid><recordid>TN_cdi_crossref_primary_10_1109_TASLP_2023_3349053</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>10381805</ieee_id><sourcerecordid>2912944681</sourcerecordid><originalsourceid>FETCH-LOGICAL-c291t-fd3dcad998749b387d5d22cb345c699ac0b31e22abe3dac6befc44cccb185d173</originalsourceid><addsrcrecordid>eNpNkN1LwzAUxYMoOOb-AfEh4HNnvro2j2WoEyrKNtljSJNb7ZxNTdIH_3u7D8Gney_nnHvgh9A1JVNKibxbF6vydcoI41POhSQpP0MjxplMJCfi_G9nklyiSQhbQgglmZSZGKFN0dvG4VXfgU-WENyuj41r8aaJH3jpqj5EvOoAzHBB5yFAG_XBUYL2bdO-Y1fjZx0-weKijw5a4yz4K3RR612AyWmO0dvD_Xq-SMqXx6d5USaGSRqT2nJrtJUyz4SseJ7Z1DJmKi5SM5NSG1JxCozpCrjVZlZBbYQwxlQ0Ty3N-BjdHv923n33EKLaut63Q6UaCpgUYpbTwcWOLuNdCB5q1fnmS_sfRYnaM1QHhmrPUJ0YDqGbY6gBgH8BntN80H8B02Bu9Q</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2912944681</pqid></control><display><type>article</type><title>Audio Super-Resolution With Robust Speech Representation Learning of Masked Autoencoder</title><source>ACM Digital Library</source><source>IEEE Electronic Library (IEL)</source><creator>Kim, Seung-Bin ; Lee, Sang-Hoon ; Choi, Ha-Yeong ; Lee, Seong-Whan</creator><creatorcontrib>Kim, Seung-Bin ; Lee, Sang-Hoon ; Choi, Ha-Yeong ; Lee, Seong-Whan</creatorcontrib><description>This paper proposes Fre-Painter, a high-fidelity audio super-resolution system that utilizes robust speech representation learning with various masking strategies. Recently, masked autoencoders have been found to be beneficial in learning robust representations of audio for speech classification tasks. Following these studies, we leverage these representations and investigate several masking strategies for neural audio super-resolution. In this paper, we propose an upper-band masking strategy with the initialization of the mask token, which is simple but efficient for audio super-resolution. Furthermore, we propose a mix-ratio masking strategy that makes the model robust for input speech with various sampling rates. For practical applicability, we extend Fre-Painter to a text-to-speech system, which synthesizes high-resolution speech using low-resolution speech data. The experimental results demonstrate that Fre-Painter outperforms other neural audio super-resolution models.</description><identifier>ISSN: 2329-9290</identifier><identifier>EISSN: 2329-9304</identifier><identifier>DOI: 10.1109/TASLP.2023.3349053</identifier><identifier>CODEN: ITASFA</identifier><language>eng</language><publisher>Piscataway: IEEE</publisher><subject>Audio data ; Audio super-resolution ; audio synthesis ; bandwidth extension ; Computational modeling ; Decoding ; Machine learning ; masked autoencoder ; Masking ; Representations ; Robustness ; Self-supervised learning ; Speech ; Speech processing ; Superresolution ; Task analysis ; Training</subject><ispartof>IEEE/ACM transactions on audio, speech, and language processing, 2024-01, Vol.32, p.1-11</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024</rights><lds50>peer_reviewed</lds50><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed><cites>FETCH-LOGICAL-c291t-fd3dcad998749b387d5d22cb345c699ac0b31e22abe3dac6befc44cccb185d173</cites><orcidid>0000-0002-8925-4474 ; 0000-0002-2287-9111 ; 0000-0002-6249-4996 ; 0000-0003-2390-7628</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/10381805$$EHTML$$P50$$Gieee$$Hfree_for_read</linktohtml><link.rule.ids>314,780,784,796,27915,27916,54749</link.rule.ids></links><search><creatorcontrib>Kim, Seung-Bin</creatorcontrib><creatorcontrib>Lee, Sang-Hoon</creatorcontrib><creatorcontrib>Choi, Ha-Yeong</creatorcontrib><creatorcontrib>Lee, Seong-Whan</creatorcontrib><title>Audio Super-Resolution With Robust Speech Representation Learning of Masked Autoencoder</title><title>IEEE/ACM transactions on audio, speech, and language processing</title><addtitle>TASLP</addtitle><description>This paper proposes Fre-Painter, a high-fidelity audio super-resolution system that utilizes robust speech representation learning with various masking strategies. Recently, masked autoencoders have been found to be beneficial in learning robust representations of audio for speech classification tasks. Following these studies, we leverage these representations and investigate several masking strategies for neural audio super-resolution. In this paper, we propose an upper-band masking strategy with the initialization of the mask token, which is simple but efficient for audio super-resolution. Furthermore, we propose a mix-ratio masking strategy that makes the model robust for input speech with various sampling rates. For practical applicability, we extend Fre-Painter to a text-to-speech system, which synthesizes high-resolution speech using low-resolution speech data. The experimental results demonstrate that Fre-Painter outperforms other neural audio super-resolution models.</description><subject>Audio data</subject><subject>Audio super-resolution</subject><subject>audio synthesis</subject><subject>bandwidth extension</subject><subject>Computational modeling</subject><subject>Decoding</subject><subject>Machine learning</subject><subject>masked autoencoder</subject><subject>Masking</subject><subject>Representations</subject><subject>Robustness</subject><subject>Self-supervised learning</subject><subject>Speech</subject><subject>Speech processing</subject><subject>Superresolution</subject><subject>Task analysis</subject><subject>Training</subject><issn>2329-9290</issn><issn>2329-9304</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>ESBDL</sourceid><sourceid>RIE</sourceid><recordid>eNpNkN1LwzAUxYMoOOb-AfEh4HNnvro2j2WoEyrKNtljSJNb7ZxNTdIH_3u7D8Gney_nnHvgh9A1JVNKibxbF6vydcoI41POhSQpP0MjxplMJCfi_G9nklyiSQhbQgglmZSZGKFN0dvG4VXfgU-WENyuj41r8aaJH3jpqj5EvOoAzHBB5yFAG_XBUYL2bdO-Y1fjZx0-weKijw5a4yz4K3RR612AyWmO0dvD_Xq-SMqXx6d5USaGSRqT2nJrtJUyz4SseJ7Z1DJmKi5SM5NSG1JxCozpCrjVZlZBbYQwxlQ0Ty3N-BjdHv923n33EKLaut63Q6UaCpgUYpbTwcWOLuNdCB5q1fnmS_sfRYnaM1QHhmrPUJ0YDqGbY6gBgH8BntN80H8B02Bu9Q</recordid><startdate>20240101</startdate><enddate>20240101</enddate><creator>Kim, Seung-Bin</creator><creator>Lee, Sang-Hoon</creator><creator>Choi, Ha-Yeong</creator><creator>Lee, Seong-Whan</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>ESBDL</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><orcidid>https://orcid.org/0000-0002-8925-4474</orcidid><orcidid>https://orcid.org/0000-0002-2287-9111</orcidid><orcidid>https://orcid.org/0000-0002-6249-4996</orcidid><orcidid>https://orcid.org/0000-0003-2390-7628</orcidid></search><sort><creationdate>20240101</creationdate><title>Audio Super-Resolution With Robust Speech Representation Learning of Masked Autoencoder</title><author>Kim, Seung-Bin ; Lee, Sang-Hoon ; Choi, Ha-Yeong ; Lee, Seong-Whan</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c291t-fd3dcad998749b387d5d22cb345c699ac0b31e22abe3dac6befc44cccb185d173</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Audio data</topic><topic>Audio super-resolution</topic><topic>audio synthesis</topic><topic>bandwidth extension</topic><topic>Computational modeling</topic><topic>Decoding</topic><topic>Machine learning</topic><topic>masked autoencoder</topic><topic>Masking</topic><topic>Representations</topic><topic>Robustness</topic><topic>Self-supervised learning</topic><topic>Speech</topic><topic>Speech processing</topic><topic>Superresolution</topic><topic>Task analysis</topic><topic>Training</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Kim, Seung-Bin</creatorcontrib><creatorcontrib>Lee, Sang-Hoon</creatorcontrib><creatorcontrib>Choi, Ha-Yeong</creatorcontrib><creatorcontrib>Lee, Seong-Whan</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE Open Access Journals</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts – Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext</fulltext></delivery><addata><au>Kim, Seung-Bin</au><au>Lee, Sang-Hoon</au><au>Choi, Ha-Yeong</au><au>Lee, Seong-Whan</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Audio Super-Resolution With Robust Speech Representation Learning of Masked Autoencoder</atitle><jtitle>IEEE/ACM transactions on audio, speech, and language processing</jtitle><stitle>TASLP</stitle><date>2024-01-01</date><risdate>2024</risdate><volume>32</volume><spage>1</spage><epage>11</epage><pages>1-11</pages><issn>2329-9290</issn><eissn>2329-9304</eissn><coden>ITASFA</coden><abstract>This paper proposes Fre-Painter, a high-fidelity audio super-resolution system that utilizes robust speech representation learning with various masking strategies. Recently, masked autoencoders have been found to be beneficial in learning robust representations of audio for speech classification tasks. Following these studies, we leverage these representations and investigate several masking strategies for neural audio super-resolution. In this paper, we propose an upper-band masking strategy with the initialization of the mask token, which is simple but efficient for audio super-resolution. Furthermore, we propose a mix-ratio masking strategy that makes the model robust for input speech with various sampling rates. For practical applicability, we extend Fre-Painter to a text-to-speech system, which synthesizes high-resolution speech using low-resolution speech data. The experimental results demonstrate that Fre-Painter outperforms other neural audio super-resolution models.</abstract><cop>Piscataway</cop><pub>IEEE</pub><doi>10.1109/TASLP.2023.3349053</doi><tpages>11</tpages><orcidid>https://orcid.org/0000-0002-8925-4474</orcidid><orcidid>https://orcid.org/0000-0002-2287-9111</orcidid><orcidid>https://orcid.org/0000-0002-6249-4996</orcidid><orcidid>https://orcid.org/0000-0003-2390-7628</orcidid><oa>free_for_read</oa></addata></record>
fulltext fulltext
identifier ISSN: 2329-9290
ispartof IEEE/ACM transactions on audio, speech, and language processing, 2024-01, Vol.32, p.1-11
issn 2329-9290
2329-9304
language eng
recordid cdi_crossref_primary_10_1109_TASLP_2023_3349053
source ACM Digital Library; IEEE Electronic Library (IEL)
subjects Audio data
Audio super-resolution
audio synthesis
bandwidth extension
Computational modeling
Decoding
Machine learning
masked autoencoder
Masking
Representations
Robustness
Self-supervised learning
Speech
Speech processing
Superresolution
Task analysis
Training
title Audio Super-Resolution With Robust Speech Representation Learning of Masked Autoencoder
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-15T03%3A31%3A02IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_cross&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Audio%20Super-Resolution%20With%20Robust%20Speech%20Representation%20Learning%20of%20Masked%20Autoencoder&rft.jtitle=IEEE/ACM%20transactions%20on%20audio,%20speech,%20and%20language%20processing&rft.au=Kim,%20Seung-Bin&rft.date=2024-01-01&rft.volume=32&rft.spage=1&rft.epage=11&rft.pages=1-11&rft.issn=2329-9290&rft.eissn=2329-9304&rft.coden=ITASFA&rft_id=info:doi/10.1109/TASLP.2023.3349053&rft_dat=%3Cproquest_cross%3E2912944681%3C/proquest_cross%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2912944681&rft_id=info:pmid/&rft_ieee_id=10381805&rfr_iscdi=true