Augmentation Adversarial Training for Self-Supervised Speaker Representation Learning

The goal of this work is to train robust speaker recognition models using self-supervised representation learning. Recent works on self-supervised speaker representations are based on contrastive learning in which they encourage within-utterance embeddings to be similar and across-utterance embeddin...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE journal of selected topics in signal processing 2022-10, Vol.16 (6), p.1253-1262
Hauptverfasser: Kang, Jingu, Huh, Jaesung, Heo, Hee Soo, Chung, Joon Son
Format: Artikel
Sprache:eng
Schlagworte:
Online-Zugang:Volltext bestellen
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
container_end_page 1262
container_issue 6
container_start_page 1253
container_title IEEE journal of selected topics in signal processing
container_volume 16
creator Kang, Jingu
Huh, Jaesung
Heo, Hee Soo
Chung, Joon Son
description The goal of this work is to train robust speaker recognition models using self-supervised representation learning. Recent works on self-supervised speaker representations are based on contrastive learning in which they encourage within-utterance embeddings to be similar and across-utterance embeddings to be dissimilar. However, since the within-utterance segments share the same acoustic characteristics, it is difficult to separate the speaker information from the channel information. To this end, we propose an augmentation adversarial training strategy that trains the network to be discriminative for the speaker information, while invariant to the augmentation applied. Since the augmentation simulates the acoustic characteristics, training the network to be invariant to augmentation also encourages the network to be invariant to the channel information in general. Extensive experiments on the VoxCeleb and VOiCES datasets show significant improvements over previous works using self-supervision, and the performance of our self-supervised models far exceeds that of humans. We also conduct semi-supervised learning experiments to show that augmentation adversarial training benefits performance in presence of speaker labels.
doi_str_mv 10.1109/JSTSP.2022.3200915
format Article
fullrecord <record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_journals_2726108495</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9864218</ieee_id><sourcerecordid>2726108495</sourcerecordid><originalsourceid>FETCH-LOGICAL-c295t-ad1c1c9ee10e66696a670a78913deb7b577e1bd81ca7f91db3abbcfa968680293</originalsourceid><addsrcrecordid>eNo9kE1Lw0AURQdRsFb_gG4CrlPnzUzmY1mKVqWgmHY9TJKXktomcSYp-O9tbOnqvsU978Ih5B7oBICap_d0mX5OGGVswhmlBpILMgIjIKZCi8vh5iwWScKvyU0IG0oTJUGMyGrar3dYd66rmjqaFnv0wfnKbaOld1Vd1euobHyU4raM075Fv68CFlHaovtGH31h6zGc-QU6PzC35Kp024B3pxyT1cvzcvYaLz7mb7PpIs6ZSbrYFZBDbhCBopTSSCcVdUob4AVmKkuUQsgKDblTpYEi4y7L8tIZqaWmzPAxeTz-bX3z02Po7KbpfX2YtEwxCVQLkxxa7NjKfROCx9K2vto5_2uB2kGf_ddnB332pO8APRyhChHPgNFSMND8D7KYbVs</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2726108495</pqid></control><display><type>article</type><title>Augmentation Adversarial Training for Self-Supervised Speaker Representation Learning</title><source>IEEE Electronic Library (IEL)</source><creator>Kang, Jingu ; Huh, Jaesung ; Heo, Hee Soo ; Chung, Joon Son</creator><creatorcontrib>Kang, Jingu ; Huh, Jaesung ; Heo, Hee Soo ; Chung, Joon Son</creatorcontrib><description>The goal of this work is to train robust speaker recognition models using self-supervised representation learning. Recent works on self-supervised speaker representations are based on contrastive learning in which they encourage within-utterance embeddings to be similar and across-utterance embeddings to be dissimilar. However, since the within-utterance segments share the same acoustic characteristics, it is difficult to separate the speaker information from the channel information. To this end, we propose an augmentation adversarial training strategy that trains the network to be discriminative for the speaker information, while invariant to the augmentation applied. Since the augmentation simulates the acoustic characteristics, training the network to be invariant to augmentation also encourages the network to be invariant to the channel information in general. Extensive experiments on the VoxCeleb and VOiCES datasets show significant improvements over previous works using self-supervision, and the performance of our self-supervised models far exceeds that of humans. We also conduct semi-supervised learning experiments to show that augmentation adversarial training benefits performance in presence of speaker labels.</description><identifier>ISSN: 1932-4553</identifier><identifier>EISSN: 1941-0484</identifier><identifier>DOI: 10.1109/JSTSP.2022.3200915</identifier><identifier>CODEN: IJSTGY</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Augmentation ; Invariants ; Representation learning ; Representations ; Self-supervised learning ; Semi-supervised learning ; Semisupervised learning ; Speaker recognition ; Speech recognition ; Training</subject><ispartof>IEEE journal of selected topics in signal processing, 2022-10, Vol.16 (6), p.1253-1262</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c295t-ad1c1c9ee10e66696a670a78913deb7b577e1bd81ca7f91db3abbcfa968680293</citedby><cites>FETCH-LOGICAL-c295t-ad1c1c9ee10e66696a670a78913deb7b577e1bd81ca7f91db3abbcfa968680293</cites><orcidid>0000-0001-7741-7275 ; 0000-0002-9284-5945</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9864218$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,777,781,793,27906,27907,54740</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9864218$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Kang, Jingu</creatorcontrib><creatorcontrib>Huh, Jaesung</creatorcontrib><creatorcontrib>Heo, Hee Soo</creatorcontrib><creatorcontrib>Chung, Joon Son</creatorcontrib><title>Augmentation Adversarial Training for Self-Supervised Speaker Representation Learning</title><title>IEEE journal of selected topics in signal processing</title><addtitle>JSTSP</addtitle><description>The goal of this work is to train robust speaker recognition models using self-supervised representation learning. Recent works on self-supervised speaker representations are based on contrastive learning in which they encourage within-utterance embeddings to be similar and across-utterance embeddings to be dissimilar. However, since the within-utterance segments share the same acoustic characteristics, it is difficult to separate the speaker information from the channel information. To this end, we propose an augmentation adversarial training strategy that trains the network to be discriminative for the speaker information, while invariant to the augmentation applied. Since the augmentation simulates the acoustic characteristics, training the network to be invariant to augmentation also encourages the network to be invariant to the channel information in general. Extensive experiments on the VoxCeleb and VOiCES datasets show significant improvements over previous works using self-supervision, and the performance of our self-supervised models far exceeds that of humans. We also conduct semi-supervised learning experiments to show that augmentation adversarial training benefits performance in presence of speaker labels.</description><subject>Augmentation</subject><subject>Invariants</subject><subject>Representation learning</subject><subject>Representations</subject><subject>Self-supervised learning</subject><subject>Semi-supervised learning</subject><subject>Semisupervised learning</subject><subject>Speaker recognition</subject><subject>Speech recognition</subject><subject>Training</subject><issn>1932-4553</issn><issn>1941-0484</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNo9kE1Lw0AURQdRsFb_gG4CrlPnzUzmY1mKVqWgmHY9TJKXktomcSYp-O9tbOnqvsU978Ih5B7oBICap_d0mX5OGGVswhmlBpILMgIjIKZCi8vh5iwWScKvyU0IG0oTJUGMyGrar3dYd66rmjqaFnv0wfnKbaOld1Vd1euobHyU4raM075Fv68CFlHaovtGH31h6zGc-QU6PzC35Kp024B3pxyT1cvzcvYaLz7mb7PpIs6ZSbrYFZBDbhCBopTSSCcVdUob4AVmKkuUQsgKDblTpYEi4y7L8tIZqaWmzPAxeTz-bX3z02Po7KbpfX2YtEwxCVQLkxxa7NjKfROCx9K2vto5_2uB2kGf_ddnB332pO8APRyhChHPgNFSMND8D7KYbVs</recordid><startdate>20221001</startdate><enddate>20221001</enddate><creator>Kang, Jingu</creator><creator>Huh, Jaesung</creator><creator>Heo, Hee Soo</creator><creator>Chung, Joon Son</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SP</scope><scope>8FD</scope><scope>H8D</scope><scope>L7M</scope><orcidid>https://orcid.org/0000-0001-7741-7275</orcidid><orcidid>https://orcid.org/0000-0002-9284-5945</orcidid></search><sort><creationdate>20221001</creationdate><title>Augmentation Adversarial Training for Self-Supervised Speaker Representation Learning</title><author>Kang, Jingu ; Huh, Jaesung ; Heo, Hee Soo ; Chung, Joon Son</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c295t-ad1c1c9ee10e66696a670a78913deb7b577e1bd81ca7f91db3abbcfa968680293</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Augmentation</topic><topic>Invariants</topic><topic>Representation learning</topic><topic>Representations</topic><topic>Self-supervised learning</topic><topic>Semi-supervised learning</topic><topic>Semisupervised learning</topic><topic>Speaker recognition</topic><topic>Speech recognition</topic><topic>Training</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Kang, Jingu</creatorcontrib><creatorcontrib>Huh, Jaesung</creatorcontrib><creatorcontrib>Heo, Hee Soo</creatorcontrib><creatorcontrib>Chung, Joon Son</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Electronics &amp; Communications Abstracts</collection><collection>Technology Research Database</collection><collection>Aerospace Database</collection><collection>Advanced Technologies Database with Aerospace</collection><jtitle>IEEE journal of selected topics in signal processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Kang, Jingu</au><au>Huh, Jaesung</au><au>Heo, Hee Soo</au><au>Chung, Joon Son</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Augmentation Adversarial Training for Self-Supervised Speaker Representation Learning</atitle><jtitle>IEEE journal of selected topics in signal processing</jtitle><stitle>JSTSP</stitle><date>2022-10-01</date><risdate>2022</risdate><volume>16</volume><issue>6</issue><spage>1253</spage><epage>1262</epage><pages>1253-1262</pages><issn>1932-4553</issn><eissn>1941-0484</eissn><coden>IJSTGY</coden><abstract>The goal of this work is to train robust speaker recognition models using self-supervised representation learning. Recent works on self-supervised speaker representations are based on contrastive learning in which they encourage within-utterance embeddings to be similar and across-utterance embeddings to be dissimilar. However, since the within-utterance segments share the same acoustic characteristics, it is difficult to separate the speaker information from the channel information. To this end, we propose an augmentation adversarial training strategy that trains the network to be discriminative for the speaker information, while invariant to the augmentation applied. Since the augmentation simulates the acoustic characteristics, training the network to be invariant to augmentation also encourages the network to be invariant to the channel information in general. Extensive experiments on the VoxCeleb and VOiCES datasets show significant improvements over previous works using self-supervision, and the performance of our self-supervised models far exceeds that of humans. We also conduct semi-supervised learning experiments to show that augmentation adversarial training benefits performance in presence of speaker labels.</abstract><cop>New York</cop><pub>IEEE</pub><doi>10.1109/JSTSP.2022.3200915</doi><tpages>10</tpages><orcidid>https://orcid.org/0000-0001-7741-7275</orcidid><orcidid>https://orcid.org/0000-0002-9284-5945</orcidid></addata></record>
fulltext fulltext_linktorsrc
identifier ISSN: 1932-4553
ispartof IEEE journal of selected topics in signal processing, 2022-10, Vol.16 (6), p.1253-1262
issn 1932-4553
1941-0484
language eng
recordid cdi_proquest_journals_2726108495
source IEEE Electronic Library (IEL)
subjects Augmentation
Invariants
Representation learning
Representations
Self-supervised learning
Semi-supervised learning
Semisupervised learning
Speaker recognition
Speech recognition
Training
title Augmentation Adversarial Training for Self-Supervised Speaker Representation Learning
url https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-17T09%3A41%3A37IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Augmentation%20Adversarial%20Training%20for%20Self-Supervised%20Speaker%20Representation%20Learning&rft.jtitle=IEEE%20journal%20of%20selected%20topics%20in%20signal%20processing&rft.au=Kang,%20Jingu&rft.date=2022-10-01&rft.volume=16&rft.issue=6&rft.spage=1253&rft.epage=1262&rft.pages=1253-1262&rft.issn=1932-4553&rft.eissn=1941-0484&rft.coden=IJSTGY&rft_id=info:doi/10.1109/JSTSP.2022.3200915&rft_dat=%3Cproquest_RIE%3E2726108495%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2726108495&rft_id=info:pmid/&rft_ieee_id=9864218&rfr_iscdi=true