SPADE: Self-supervised Pretraining for Acoustic DisEntanglement

Self-supervised representation learning approaches have grown in popularity due to the ability to train models on large amounts of unlabeled data and have demonstrated success in diverse fields such as natural language processing, computer vision, and speech. Previous self-supervised work in the spe...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Harvill, John, Barber, Jarred, Nair, Arun, Pishehvar, Ramin
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Learning Computer Science - Sound
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Harvill, John Barber, Jarred Nair, Arun Pishehvar, Ramin
description	Self-supervised representation learning approaches have grown in popularity due to the ability to train models on large amounts of unlabeled data and have demonstrated success in diverse fields such as natural language processing, computer vision, and speech. Previous self-supervised work in the speech domain has disentangled multiple attributes of speech such as linguistic content, speaker identity, and rhythm. In this work, we introduce a self-supervised approach to disentangle room acoustics from speech and use the acoustic representation on the downstream task of device arbitration. Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce, indicating that our pretraining scheme learns to encode room acoustic information while remaining invariant to other attributes of the speech signal.
doi_str_mv	10.48550/arxiv.2302.01483
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2302_01483</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2302_01483</sourcerecordid><originalsourceid>FETCH-LOGICAL-a673-3609534836b7e43c7a4d504b35b0c4144a70681e75446d31a8913c97fc64f2f03</originalsourceid><addsrcrecordid>eNotz7FuwjAUhWEvDBX0ATrVL5DUzr22ky5VBGmLhAQS7JHj2MhSMMgOqLw9LWU623_0EfLCWY6lEOxNxx9_yQtgRc44lvBEPrabetG8060dXJbOJxsvPtmebqIdo_bBhz11x0hrczyn0Ru68KkJow77wR5sGGdk4vSQ7PNjp2T32ezm39lq_bWc16tMSwUZSFYJ-D2UnbIIRmnsBcMORMcMckStmCy5VQJR9sB1WXEwlXJGoiscgyl5_c_eBe0p-oOO1_ZP0t4lcAMK9UG0</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>SPADE: Self-supervised Pretraining for Acoustic DisEntanglement</title><source>arXiv.org</source><creator>Harvill, John ; Barber, Jarred ; Nair, Arun ; Pishehvar, Ramin</creator><creatorcontrib>Harvill, John ; Barber, Jarred ; Nair, Arun ; Pishehvar, Ramin</creatorcontrib><description>Self-supervised representation learning approaches have grown in popularity due to the ability to train models on large amounts of unlabeled data and have demonstrated success in diverse fields such as natural language processing, computer vision, and speech. Previous self-supervised work in the speech domain has disentangled multiple attributes of speech such as linguistic content, speaker identity, and rhythm. In this work, we introduce a self-supervised approach to disentangle room acoustics from speech and use the acoustic representation on the downstream task of device arbitration. Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce, indicating that our pretraining scheme learns to encode room acoustic information while remaining invariant to other attributes of the speech signal.</description><identifier>DOI: 10.48550/arxiv.2302.01483</identifier><language>eng</language><subject>Computer Science - Learning ; Computer Science - Sound</subject><creationdate>2023-02</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,780,885</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2302.01483$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2302.01483$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Harvill, John</creatorcontrib><creatorcontrib>Barber, Jarred</creatorcontrib><creatorcontrib>Nair, Arun</creatorcontrib><creatorcontrib>Pishehvar, Ramin</creatorcontrib><title>SPADE: Self-supervised Pretraining for Acoustic DisEntanglement</title><description>Self-supervised representation learning approaches have grown in popularity due to the ability to train models on large amounts of unlabeled data and have demonstrated success in diverse fields such as natural language processing, computer vision, and speech. Previous self-supervised work in the speech domain has disentangled multiple attributes of speech such as linguistic content, speaker identity, and rhythm. In this work, we introduce a self-supervised approach to disentangle room acoustics from speech and use the acoustic representation on the downstream task of device arbitration. Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce, indicating that our pretraining scheme learns to encode room acoustic information while remaining invariant to other attributes of the speech signal.</description><subject>Computer Science - Learning</subject><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2023</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz7FuwjAUhWEvDBX0ATrVL5DUzr22ky5VBGmLhAQS7JHj2MhSMMgOqLw9LWU623_0EfLCWY6lEOxNxx9_yQtgRc44lvBEPrabetG8060dXJbOJxsvPtmebqIdo_bBhz11x0hrczyn0Ru68KkJow77wR5sGGdk4vSQ7PNjp2T32ezm39lq_bWc16tMSwUZSFYJ-D2UnbIIRmnsBcMORMcMckStmCy5VQJR9sB1WXEwlXJGoiscgyl5_c_eBe0p-oOO1_ZP0t4lcAMK9UG0</recordid><startdate>20230202</startdate><enddate>20230202</enddate><creator>Harvill, John</creator><creator>Barber, Jarred</creator><creator>Nair, Arun</creator><creator>Pishehvar, Ramin</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20230202</creationdate><title>SPADE: Self-supervised Pretraining for Acoustic DisEntanglement</title><author>Harvill, John ; Barber, Jarred ; Nair, Arun ; Pishehvar, Ramin</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a673-3609534836b7e43c7a4d504b35b0c4144a70681e75446d31a8913c97fc64f2f03</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2023</creationdate><topic>Computer Science - Learning</topic><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Harvill, John</creatorcontrib><creatorcontrib>Barber, Jarred</creatorcontrib><creatorcontrib>Nair, Arun</creatorcontrib><creatorcontrib>Pishehvar, Ramin</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Harvill, John</au><au>Barber, Jarred</au><au>Nair, Arun</au><au>Pishehvar, Ramin</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>SPADE: Self-supervised Pretraining for Acoustic DisEntanglement</atitle><date>2023-02-02</date><risdate>2023</risdate><abstract>Self-supervised representation learning approaches have grown in popularity due to the ability to train models on large amounts of unlabeled data and have demonstrated success in diverse fields such as natural language processing, computer vision, and speech. Previous self-supervised work in the speech domain has disentangled multiple attributes of speech such as linguistic content, speaker identity, and rhythm. In this work, we introduce a self-supervised approach to disentangle room acoustics from speech and use the acoustic representation on the downstream task of device arbitration. Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce, indicating that our pretraining scheme learns to encode room acoustic information while remaining invariant to other attributes of the speech signal.</abstract><doi>10.48550/arxiv.2302.01483</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2302.01483
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2302_01483
source	arXiv.org
subjects	Computer Science - Learning Computer Science - Sound
title	SPADE: Self-supervised Pretraining for Acoustic DisEntanglement
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-21T09%3A51%3A03IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=SPADE:%20Self-supervised%20Pretraining%20for%20Acoustic%20DisEntanglement&rft.au=Harvill,%20John&rft.date=2023-02-02&rft_id=info:doi/10.48550/arxiv.2302.01483&rft_dat=%3Carxiv_GOX%3E2302_01483%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true