SPADE: Self-supervised Pretraining for Acoustic DisEntanglement
Self-supervised representation learning approaches have grown in popularity due to the ability to train models on large amounts of unlabeled data and have demonstrated success in diverse fields such as natural language processing, computer vision, and speech. Previous self-supervised work in the spe...
Gespeichert in:
Hauptverfasser: | , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
container_end_page | |
---|---|
container_issue | |
container_start_page | |
container_title | |
container_volume | |
creator | Harvill, John Barber, Jarred Nair, Arun Pishehvar, Ramin |
description | Self-supervised representation learning approaches have grown in popularity
due to the ability to train models on large amounts of unlabeled data and have
demonstrated success in diverse fields such as natural language processing,
computer vision, and speech. Previous self-supervised work in the speech domain
has disentangled multiple attributes of speech such as linguistic content,
speaker identity, and rhythm. In this work, we introduce a self-supervised
approach to disentangle room acoustics from speech and use the acoustic
representation on the downstream task of device arbitration. Our results
demonstrate that our proposed approach significantly improves performance over
a baseline when labeled training data is scarce, indicating that our
pretraining scheme learns to encode room acoustic information while remaining
invariant to other attributes of the speech signal. |
doi_str_mv | 10.48550/arxiv.2302.01483 |
format | Article |
fullrecord | <record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2302_01483</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2302_01483</sourcerecordid><originalsourceid>FETCH-LOGICAL-a673-3609534836b7e43c7a4d504b35b0c4144a70681e75446d31a8913c97fc64f2f03</originalsourceid><addsrcrecordid>eNotz7FuwjAUhWEvDBX0ATrVL5DUzr22ky5VBGmLhAQS7JHj2MhSMMgOqLw9LWU623_0EfLCWY6lEOxNxx9_yQtgRc44lvBEPrabetG8060dXJbOJxsvPtmebqIdo_bBhz11x0hrczyn0Ru68KkJow77wR5sGGdk4vSQ7PNjp2T32ezm39lq_bWc16tMSwUZSFYJ-D2UnbIIRmnsBcMORMcMckStmCy5VQJR9sB1WXEwlXJGoiscgyl5_c_eBe0p-oOO1_ZP0t4lcAMK9UG0</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>SPADE: Self-supervised Pretraining for Acoustic DisEntanglement</title><source>arXiv.org</source><creator>Harvill, John ; Barber, Jarred ; Nair, Arun ; Pishehvar, Ramin</creator><creatorcontrib>Harvill, John ; Barber, Jarred ; Nair, Arun ; Pishehvar, Ramin</creatorcontrib><description>Self-supervised representation learning approaches have grown in popularity
due to the ability to train models on large amounts of unlabeled data and have
demonstrated success in diverse fields such as natural language processing,
computer vision, and speech. Previous self-supervised work in the speech domain
has disentangled multiple attributes of speech such as linguistic content,
speaker identity, and rhythm. In this work, we introduce a self-supervised
approach to disentangle room acoustics from speech and use the acoustic
representation on the downstream task of device arbitration. Our results
demonstrate that our proposed approach significantly improves performance over
a baseline when labeled training data is scarce, indicating that our
pretraining scheme learns to encode room acoustic information while remaining
invariant to other attributes of the speech signal.</description><identifier>DOI: 10.48550/arxiv.2302.01483</identifier><language>eng</language><subject>Computer Science - Learning ; Computer Science - Sound</subject><creationdate>2023-02</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2302.01483$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2302.01483$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Harvill, John</creatorcontrib><creatorcontrib>Barber, Jarred</creatorcontrib><creatorcontrib>Nair, Arun</creatorcontrib><creatorcontrib>Pishehvar, Ramin</creatorcontrib><title>SPADE: Self-supervised Pretraining for Acoustic DisEntanglement</title><description>Self-supervised representation learning approaches have grown in popularity
due to the ability to train models on large amounts of unlabeled data and have
demonstrated success in diverse fields such as natural language processing,
computer vision, and speech. Previous self-supervised work in the speech domain
has disentangled multiple attributes of speech such as linguistic content,
speaker identity, and rhythm. In this work, we introduce a self-supervised
approach to disentangle room acoustics from speech and use the acoustic
representation on the downstream task of device arbitration. Our results
demonstrate that our proposed approach significantly improves performance over
a baseline when labeled training data is scarce, indicating that our
pretraining scheme learns to encode room acoustic information while remaining
invariant to other attributes of the speech signal.</description><subject>Computer Science - Learning</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz7FuwjAUhWEvDBX0ATrVL5DUzr22ky5VBGmLhAQS7JHj2MhSMMgOqLw9LWU623_0EfLCWY6lEOxNxx9_yQtgRc44lvBEPrabetG8060dXJbOJxsvPtmebqIdo_bBhz11x0hrczyn0Ru68KkJow77wR5sGGdk4vSQ7PNjp2T32ezm39lq_bWc16tMSwUZSFYJ-D2UnbIIRmnsBcMORMcMckStmCy5VQJR9sB1WXEwlXJGoiscgyl5_c_eBe0p-oOO1_ZP0t4lcAMK9UG0</recordid><startdate>20230202</startdate><enddate>20230202</enddate><creator>Harvill, John</creator><creator>Barber, Jarred</creator><creator>Nair, Arun</creator><creator>Pishehvar, Ramin</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230202</creationdate><title>SPADE: Self-supervised Pretraining for Acoustic DisEntanglement</title><author>Harvill, John ; Barber, Jarred ; Nair, Arun ; Pishehvar, Ramin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a673-3609534836b7e43c7a4d504b35b0c4144a70681e75446d31a8913c97fc64f2f03</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Learning</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Harvill, John</creatorcontrib><creatorcontrib>Barber, Jarred</creatorcontrib><creatorcontrib>Nair, Arun</creatorcontrib><creatorcontrib>Pishehvar, Ramin</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Harvill, John</au><au>Barber, Jarred</au><au>Nair, Arun</au><au>Pishehvar, Ramin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>SPADE: Self-supervised Pretraining for Acoustic DisEntanglement</atitle><date>2023-02-02</date><risdate>2023</risdate><abstract>Self-supervised representation learning approaches have grown in popularity
due to the ability to train models on large amounts of unlabeled data and have
demonstrated success in diverse fields such as natural language processing,
computer vision, and speech. Previous self-supervised work in the speech domain
has disentangled multiple attributes of speech such as linguistic content,
speaker identity, and rhythm. In this work, we introduce a self-supervised
approach to disentangle room acoustics from speech and use the acoustic
representation on the downstream task of device arbitration. Our results
demonstrate that our proposed approach significantly improves performance over
a baseline when labeled training data is scarce, indicating that our
pretraining scheme learns to encode room acoustic information while remaining
invariant to other attributes of the speech signal.</abstract><doi>10.48550/arxiv.2302.01483</doi><oa>free_for_read</oa></addata></record> |
fulltext | fulltext_linktorsrc |
identifier | DOI: 10.48550/arxiv.2302.01483 |
ispartof | |
issn | |
language | eng |
recordid | cdi_arxiv_primary_2302_01483 |
source | arXiv.org |
subjects | Computer Science - Learning Computer Science - Sound |
title | SPADE: Self-supervised Pretraining for Acoustic DisEntanglement |
url | https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-21T09%3A51%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=SPADE:%20Self-supervised%20Pretraining%20for%20Acoustic%20DisEntanglement&rft.au=Harvill,%20John&rft.date=2023-02-02&rft_id=info:doi/10.48550/arxiv.2302.01483&rft_dat=%3Carxiv_GOX%3E2302_01483%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true |