Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications

Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn to correlate the visual scene and sound, as well as localize the sound source only by observing them like humans? To investigate its empirical learnability, in this work we first present a novel unsup...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on pattern analysis and machine intelligence 2021-05, Vol.43 (5), p.1605-1619
Hauptverfasser:	Senocak, Arda, Oh, Tae-Hyun, Kim, Junsik, Yang, Ming-Hsuan, Kweon, In So
Format:	Artikel
Sprache:	eng
Schlagworte:	Acoustics Algorithms Annotations Audio-visual learning Correlation cross-modal retrieval Deep learning Empirical analysis multi-modal learning Network architecture Performance evaluation self-supervision Sound sound localization Sound sources Task analysis Unsupervised learning Videos Visualization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	1619
container_issue	5
container_start_page	1605
container_title	IEEE transactions on pattern analysis and machine intelligence
container_volume	43
creator	Senocak, Arda Oh, Tae-Hyun Kim, Junsik Yang, Ming-Hsuan Kweon, In So
description	Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn to correlate the visual scene and sound, as well as localize the sound source only by observing them like humans? To investigate its empirical learnability, in this work we first present a novel unsupervised algorithm to address the problem of localizing sound sources in visual scenes. In order to achieve this goal, a two-stream network structure which handles each modality with attention mechanism is developed for sound source localization. The network naturally reveals the localized response in the scene without human annotation. In addition, a new sound source dataset is developed for performance evaluation. Nevertheless, our empirical evaluation shows that the unsupervised method generates false conclusions in some cases. Thereby, we show that this false conclusion cannot be fixed without human prior knowledge due to the well-known correlation and causality mismatch misconception. To fix this issue, we extend our network to the supervised and semi-supervised network settings via a simple modification due to the general architecture of our two-stream network. We show that the false conclusions can be effectively corrected even with a small amount of supervision, i.e. , semi-supervised setup. Furthermore, we present the versatility of the learned audio and visual embeddings on the cross-modal content alignment and we extend this proposed algorithm to a new application, sound saliency based automatic camera view panning in 360 degree videos.
doi_str_mv	10.1109/TPAMI.2019.2952095
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_proquest_miscellaneous_2314566624</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>8894565</ieee_id><sourcerecordid>2314566624</sourcerecordid><originalsourceid>FETCH-LOGICAL-c351t-ec0839b9b361d3b19990f092932b4812cd1377c3e95e96cbcd03bf66529101273</originalsourceid><addsrcrecordid>eNpdkMlOwzAQhi0EoqXwAiChSFy4pHipnQy3qGKpVBaphWvkOFNklDolTg7l6XFp6YGL5-Bvlv8j5JzRIWMUbuav2dNkyCmDIQfJKcgD0mcgIBZSwCHpU6Z4nKY87ZET7z8pZSNJxTHpCZZwPkp4nzxPUTfOuo-oraNpbXRlvzGa1Z0rN29j0EfWRe_Wd7qKZgYd-tsoc7pae-sjHbBstaqs0a2tnT8lRwtdeTzb1QF5u7-bjx_j6cvDZJxNYyMka2M0NBVQQCEUK0XBAIAuKHAQvBiljJuSiSQxAkEiKFOYkopioZTkwCjjiRiQ6-3cVVN_dejbfGm9warSDuvO51yEqEopPgro1T_0M-QKAQIlNzupVDRQfEuZpva-wUW-auxSN-uc0XxjO_-1nW9s5zvboelyN7orlljuW_70BuBiC1hE3H-nKYTjpPgBEciBPA</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2509290560</pqid></control><display><type>article</type><title>Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications</title><source>IEEE Electronic Library (IEL)</source><creator>Senocak, Arda ; Oh, Tae-Hyun ; Kim, Junsik ; Yang, Ming-Hsuan ; Kweon, In So</creator><creatorcontrib>Senocak, Arda ; Oh, Tae-Hyun ; Kim, Junsik ; Yang, Ming-Hsuan ; Kweon, In So</creatorcontrib><description>Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn to correlate the visual scene and sound, as well as localize the sound source only by observing them like humans? To investigate its empirical learnability, in this work we first present a novel unsupervised algorithm to address the problem of localizing sound sources in visual scenes. In order to achieve this goal, a two-stream network structure which handles each modality with attention mechanism is developed for sound source localization. The network naturally reveals the localized response in the scene without human annotation. In addition, a new sound source dataset is developed for performance evaluation. Nevertheless, our empirical evaluation shows that the unsupervised method generates false conclusions in some cases. Thereby, we show that this false conclusion cannot be fixed without human prior knowledge due to the well-known correlation and causality mismatch misconception. To fix this issue, we extend our network to the supervised and semi-supervised network settings via a simple modification due to the general architecture of our two-stream network. We show that the false conclusions can be effectively corrected even with a small amount of supervision, i.e. , semi-supervised setup. Furthermore, we present the versatility of the learned audio and visual embeddings on the cross-modal content alignment and we extend this proposed algorithm to a new application, sound saliency based automatic camera view panning in 360 degree videos.</description><identifier>ISSN: 0162-8828</identifier><identifier>EISSN: 1939-3539</identifier><identifier>EISSN: 2160-9292</identifier><identifier>DOI: 10.1109/TPAMI.2019.2952095</identifier><identifier>PMID: 31722472</identifier><identifier>CODEN: ITPIDJ</identifier><language>eng</language><publisher>United States: IEEE</publisher><subject>Acoustics ; Algorithms ; Annotations ; Audio-visual learning ; Correlation ; cross-modal retrieval ; Deep learning ; Empirical analysis ; multi-modal learning ; Network architecture ; Performance evaluation ; self-supervision ; Sound ; sound localization ; Sound sources ; Task analysis ; Unsupervised learning ; Videos ; Visualization</subject><ispartof>IEEE transactions on pattern analysis and machine intelligence, 2021-05, Vol.43 (5), p.1605-1619</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c351t-ec0839b9b361d3b19990f092932b4812cd1377c3e95e96cbcd03bf66529101273</citedby><cites>FETCH-LOGICAL-c351t-ec0839b9b361d3b19990f092932b4812cd1377c3e95e96cbcd03bf66529101273</cites><orcidid>0000-0003-0468-1571 ; 0000-0003-2555-5232 ; 0000-0001-9141-3270 ; 0000-0003-4848-2304</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/8894565$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,780,784,796,27924,27925,54758</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/8894565$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc><backlink>$$Uhttps://www.ncbi.nlm.nih.gov/pubmed/31722472$$D View this record in MEDLINE/PubMed$$Hfree_for_read</backlink></links><search><creatorcontrib>Senocak, Arda</creatorcontrib><creatorcontrib>Oh, Tae-Hyun</creatorcontrib><creatorcontrib>Kim, Junsik</creatorcontrib><creatorcontrib>Yang, Ming-Hsuan</creatorcontrib><creatorcontrib>Kweon, In So</creatorcontrib><title>Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications</title><title>IEEE transactions on pattern analysis and machine intelligence</title><addtitle>TPAMI</addtitle><addtitle>IEEE Trans Pattern Anal Mach Intell</addtitle><description>Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn to correlate the visual scene and sound, as well as localize the sound source only by observing them like humans? To investigate its empirical learnability, in this work we first present a novel unsupervised algorithm to address the problem of localizing sound sources in visual scenes. In order to achieve this goal, a two-stream network structure which handles each modality with attention mechanism is developed for sound source localization. The network naturally reveals the localized response in the scene without human annotation. In addition, a new sound source dataset is developed for performance evaluation. Nevertheless, our empirical evaluation shows that the unsupervised method generates false conclusions in some cases. Thereby, we show that this false conclusion cannot be fixed without human prior knowledge due to the well-known correlation and causality mismatch misconception. To fix this issue, we extend our network to the supervised and semi-supervised network settings via a simple modification due to the general architecture of our two-stream network. We show that the false conclusions can be effectively corrected even with a small amount of supervision, i.e. , semi-supervised setup. Furthermore, we present the versatility of the learned audio and visual embeddings on the cross-modal content alignment and we extend this proposed algorithm to a new application, sound saliency based automatic camera view panning in 360 degree videos.</description><subject>Acoustics</subject><subject>Algorithms</subject><subject>Annotations</subject><subject>Audio-visual learning</subject><subject>Correlation</subject><subject>cross-modal retrieval</subject><subject>Deep learning</subject><subject>Empirical analysis</subject><subject>multi-modal learning</subject><subject>Network architecture</subject><subject>Performance evaluation</subject><subject>self-supervision</subject><subject>Sound</subject><subject>sound localization</subject><subject>Sound sources</subject><subject>Task analysis</subject><subject>Unsupervised learning</subject><subject>Videos</subject><subject>Visualization</subject><issn>0162-8828</issn><issn>1939-3539</issn><issn>2160-9292</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2021</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpdkMlOwzAQhi0EoqXwAiChSFy4pHipnQy3qGKpVBaphWvkOFNklDolTg7l6XFp6YGL5-Bvlv8j5JzRIWMUbuav2dNkyCmDIQfJKcgD0mcgIBZSwCHpU6Z4nKY87ZET7z8pZSNJxTHpCZZwPkp4nzxPUTfOuo-oraNpbXRlvzGa1Z0rN29j0EfWRe_Wd7qKZgYd-tsoc7pae-sjHbBstaqs0a2tnT8lRwtdeTzb1QF5u7-bjx_j6cvDZJxNYyMka2M0NBVQQCEUK0XBAIAuKHAQvBiljJuSiSQxAkEiKFOYkopioZTkwCjjiRiQ6-3cVVN_dejbfGm9warSDuvO51yEqEopPgro1T_0M-QKAQIlNzupVDRQfEuZpva-wUW-auxSN-uc0XxjO_-1nW9s5zvboelyN7orlljuW_70BuBiC1hE3H-nKYTjpPgBEciBPA</recordid><startdate>20210501</startdate><enddate>20210501</enddate><creator>Senocak, Arda</creator><creator>Oh, Tae-Hyun</creator><creator>Kim, Junsik</creator><creator>Yang, Ming-Hsuan</creator><creator>Kweon, In So</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>NPM</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0003-0468-1571</orcidid><orcidid>https://orcid.org/0000-0003-2555-5232</orcidid><orcidid>https://orcid.org/0000-0001-9141-3270</orcidid><orcidid>https://orcid.org/0000-0003-4848-2304</orcidid></search><sort><creationdate>20210501</creationdate><title>Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications</title><author>Senocak, Arda ; Oh, Tae-Hyun ; Kim, Junsik ; Yang, Ming-Hsuan ; Kweon, In So</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c351t-ec0839b9b361d3b19990f092932b4812cd1377c3e95e96cbcd03bf66529101273</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2021</creationdate><topic>Acoustics</topic><topic>Algorithms</topic><topic>Annotations</topic><topic>Audio-visual learning</topic><topic>Correlation</topic><topic>cross-modal retrieval</topic><topic>Deep learning</topic><topic>Empirical analysis</topic><topic>multi-modal learning</topic><topic>Network architecture</topic><topic>Performance evaluation</topic><topic>self-supervision</topic><topic>Sound</topic><topic>sound localization</topic><topic>Sound sources</topic><topic>Task analysis</topic><topic>Unsupervised learning</topic><topic>Videos</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Senocak, Arda</creatorcontrib><creatorcontrib>Oh, Tae-Hyun</creatorcontrib><creatorcontrib>Kim, Junsik</creatorcontrib><creatorcontrib>Yang, Ming-Hsuan</creatorcontrib><creatorcontrib>Kweon, In So</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>PubMed</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transactions on pattern analysis and machine intelligence</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Senocak, Arda</au><au>Oh, Tae-Hyun</au><au>Kim, Junsik</au><au>Yang, Ming-Hsuan</au><au>Kweon, In So</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications</atitle><jtitle>IEEE transactions on pattern analysis and machine intelligence</jtitle><stitle>TPAMI</stitle><addtitle>IEEE Trans Pattern Anal Mach Intell</addtitle><date>2021-05-01</date><risdate>2021</risdate><volume>43</volume><issue>5</issue><spage>1605</spage><epage>1619</epage><pages>1605-1619</pages><issn>0162-8828</issn><eissn>1939-3539</eissn><eissn>2160-9292</eissn><coden>ITPIDJ</coden><abstract>Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn to correlate the visual scene and sound, as well as localize the sound source only by observing them like humans? To investigate its empirical learnability, in this work we first present a novel unsupervised algorithm to address the problem of localizing sound sources in visual scenes. In order to achieve this goal, a two-stream network structure which handles each modality with attention mechanism is developed for sound source localization. The network naturally reveals the localized response in the scene without human annotation. In addition, a new sound source dataset is developed for performance evaluation. Nevertheless, our empirical evaluation shows that the unsupervised method generates false conclusions in some cases. Thereby, we show that this false conclusion cannot be fixed without human prior knowledge due to the well-known correlation and causality mismatch misconception. To fix this issue, we extend our network to the supervised and semi-supervised network settings via a simple modification due to the general architecture of our two-stream network. We show that the false conclusions can be effectively corrected even with a small amount of supervision, i.e. , semi-supervised setup. Furthermore, we present the versatility of the learned audio and visual embeddings on the cross-modal content alignment and we extend this proposed algorithm to a new application, sound saliency based automatic camera view panning in 360 degree videos.</abstract><cop>United States</cop><pub>IEEE</pub><pmid>31722472</pmid><doi>10.1109/TPAMI.2019.2952095</doi><tpages>15</tpages><orcidid>https://orcid.org/0000-0003-0468-1571</orcidid><orcidid>https://orcid.org/0000-0003-2555-5232</orcidid><orcidid>https://orcid.org/0000-0001-9141-3270</orcidid><orcidid>https://orcid.org/0000-0003-4848-2304</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 0162-8828
ispartof	IEEE transactions on pattern analysis and machine intelligence, 2021-05, Vol.43 (5), p.1605-1619
issn	0162-8828 1939-3539 2160-9292
language	eng
recordid	cdi_proquest_miscellaneous_2314566624
source	IEEE Electronic Library (IEL)
subjects	Acoustics Algorithms Annotations Audio-visual learning Correlation cross-modal retrieval Deep learning Empirical analysis multi-modal learning Network architecture Performance evaluation self-supervision Sound sound localization Sound sources Task analysis Unsupervised learning Videos Visualization
title	Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-19T06%3A35%3A09IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Learning%20to%20Localize%20Sound%20Sources%20in%20Visual%20Scenes:%20Analysis%20and%20Applications&rft.jtitle=IEEE%20transactions%20on%20pattern%20analysis%20and%20machine%20intelligence&rft.au=Senocak,%20Arda&rft.date=2021-05-01&rft.volume=43&rft.issue=5&rft.spage=1605&rft.epage=1619&rft.pages=1605-1619&rft.issn=0162-8828&rft.eissn=1939-3539&rft.coden=ITPIDJ&rft_id=info:doi/10.1109/TPAMI.2019.2952095&rft_dat=%3Cproquest_RIE%3E2314566624%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2509290560&rft_id=info:pmid/31722472&rft_ieee_id=8894565&rfr_iscdi=true