Unsupervised Synthetic Acoustic Image Generation for Audio-Visual Scene Understanding

Acoustic images are an emergent data modality for multimodal scene understanding. Such images have the peculiarity of distinguishing the spectral signature of the sound coming from different directions in space, thus providing a richer information as compared to that derived from single or binaural...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on image processing 2022, Vol.31, p.1-1
Hauptverfasser:	Sanguineti, Valentina, Morerio, Pietro, Del Bue, Alessio, Murino, Vittorio
Format:	Artikel
Sprache:	eng
Schlagworte:	Acoustics Arrays Audio data Audio signals Datasets Generative adversarial networks Image classification Image processing Image quality Image reconstruction Location awareness Microphones Quality assessment Scene analysis Sound localization Spectral signatures Task analysis Training Visualization
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page	1
container_issue
container_start_page	1
container_title	IEEE transactions on image processing
container_volume	31
creator	Sanguineti, Valentina Morerio, Pietro Del Bue, Alessio Murino, Vittorio
description	Acoustic images are an emergent data modality for multimodal scene understanding. Such images have the peculiarity of distinguishing the spectral signature of the sound coming from different directions in space, thus providing a richer information as compared to that derived from single or binaural microphones. However, acoustic images are typically generated by cumbersome and costly microphone arrays which are not as widespread as ordinary microphones. This paper shows that it is still possible to generate acoustic images from off-the-shelf cameras equipped with only a single microphone and how they can be exploited for audio-visual scene understanding. We propose three architectures inspired by Variational Autoencoder, U-Net and adversarial models, and we assess their advantages and drawbacks. Such models are trained to generate spatialized audio by conditioning them to the associated video sequence and its corresponding monaural audio track. Our models are trained using the data collected by a microphone array as ground truth. Thus they learn to mimic the output of an array of microphones in the very same conditions. We assess the quality of the generated acoustic images considering standard generation metrics and different downstream tasks (classification, cross-modal retrieval and sound localization). We also evaluate our proposed models by considering multimodal datasets containing acoustic images, as well as datasets containing just monaural audio signals and RGB video frames. In all of the addressed downstream tasks we obtain notable performances using the generated acoustic data, when compared to the state of the art and to the results obtained using real acoustic images as input.
doi_str_mv	10.1109/TIP.2022.3219228
format	Article
fullrecord	<record><control><sourceid>proquest_RIE</sourceid><recordid>TN_cdi_ieee_primary_9942928</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><ieee_id>9942928</ieee_id><sourcerecordid>2734167096</sourcerecordid><originalsourceid>FETCH-LOGICAL-c324t-66df3b4acd24b7626c112d98d0886f1bb96de0725512b6737dc1a837f7f6b9ae3</originalsourceid><addsrcrecordid>eNpdkM1LAzEQxYMoWqt3wcuCFy9bM0mabI6l-FEoKGi9LtlkViNttia7gv-9KRUPnubB_N7M4xFyAXQCQPXNy-JpwihjE85AM1YdkBFoASWlgh1mTaeqVCD0CTlN6YNSEFOQx-SESy5kJdmIrFYhDVuMXz6hK56_Q_-OvbfFzHZD2onFxrxhcY8Bo-l9F4q2i8VscL4rX30azLp4tnlZrILDmHoTnA9vZ-SoNeuE579zTFZ3ty_zh3L5eL-Yz5al5Uz0pZSu5Y0w1jHRKMmkBWBOV45WlWyhabR0SBWbToE1UnHlLJiKq1a1stEG-Zhc7-9uY_c5YOrrjU8W12sTMOevmeICpKJaZvTqH_rRDTHkdDtK5ocCRKbonrKxSyliW2-j35j4XQOtd5XXufJ6V3n9W3m2XO4tHhH_cK0F03n7A6pfepQ</addsrcrecordid><sourcetype>Aggregation Database</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype><pqid>2736886414</pqid></control><display><type>article</type><title>Unsupervised Synthetic Acoustic Image Generation for Audio-Visual Scene Understanding</title><source>IEEE Electronic Library (IEL)</source><creator>Sanguineti, Valentina ; Morerio, Pietro ; Del Bue, Alessio ; Murino, Vittorio</creator><creatorcontrib>Sanguineti, Valentina ; Morerio, Pietro ; Del Bue, Alessio ; Murino, Vittorio</creatorcontrib><description>Acoustic images are an emergent data modality for multimodal scene understanding. Such images have the peculiarity of distinguishing the spectral signature of the sound coming from different directions in space, thus providing a richer information as compared to that derived from single or binaural microphones. However, acoustic images are typically generated by cumbersome and costly microphone arrays which are not as widespread as ordinary microphones. This paper shows that it is still possible to generate acoustic images from off-the-shelf cameras equipped with only a single microphone and how they can be exploited for audio-visual scene understanding. We propose three architectures inspired by Variational Autoencoder, U-Net and adversarial models, and we assess their advantages and drawbacks. Such models are trained to generate spatialized audio by conditioning them to the associated video sequence and its corresponding monaural audio track. Our models are trained using the data collected by a microphone array as ground truth. Thus they learn to mimic the output of an array of microphones in the very same conditions. We assess the quality of the generated acoustic images considering standard generation metrics and different downstream tasks (classification, cross-modal retrieval and sound localization). We also evaluate our proposed models by considering multimodal datasets containing acoustic images, as well as datasets containing just monaural audio signals and RGB video frames. In all of the addressed downstream tasks we obtain notable performances using the generated acoustic data, when compared to the state of the art and to the results obtained using real acoustic images as input.</description><identifier>ISSN: 1057-7149</identifier><identifier>EISSN: 1941-0042</identifier><identifier>DOI: 10.1109/TIP.2022.3219228</identifier><identifier>PMID: 36346862</identifier><identifier>CODEN: IIPRE4</identifier><language>eng</language><publisher>New York: IEEE</publisher><subject>Acoustics ; Arrays ; Audio data ; Audio signals ; Datasets ; Generative adversarial networks ; Image classification ; Image processing ; Image quality ; Image reconstruction ; Location awareness ; Microphones ; Quality assessment ; Scene analysis ; Sound localization ; Spectral signatures ; Task analysis ; Training ; Visualization</subject><ispartof>IEEE transactions on image processing, 2022, Vol.31, p.1-1</ispartof><rights>Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022</rights><lds50>peer_reviewed</lds50><woscitedreferencessubscribed>false</woscitedreferencessubscribed><citedby>FETCH-LOGICAL-c324t-66df3b4acd24b7626c112d98d0886f1bb96de0725512b6737dc1a837f7f6b9ae3</citedby><cites>FETCH-LOGICAL-c324t-66df3b4acd24b7626c112d98d0886f1bb96de0725512b6737dc1a837f7f6b9ae3</cites><orcidid>0000-0001-5259-1496 ; 0000-0001-7995-6205 ; 0000-0002-8645-2328 ; 0000-0002-2262-4872</orcidid></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><linktohtml>$$Uhttps://ieeexplore.ieee.org/document/9942928$$EHTML$$P50$$Gieee$$H</linktohtml><link.rule.ids>314,776,780,792,4010,27900,27901,27902,54733</link.rule.ids><linktorsrc>$$Uhttps://ieeexplore.ieee.org/document/9942928$$EView_record_in_IEEE$$FView_record_in_$$GIEEE</linktorsrc></links><search><creatorcontrib>Sanguineti, Valentina</creatorcontrib><creatorcontrib>Morerio, Pietro</creatorcontrib><creatorcontrib>Del Bue, Alessio</creatorcontrib><creatorcontrib>Murino, Vittorio</creatorcontrib><title>Unsupervised Synthetic Acoustic Image Generation for Audio-Visual Scene Understanding</title><title>IEEE transactions on image processing</title><addtitle>TIP</addtitle><description>Acoustic images are an emergent data modality for multimodal scene understanding. Such images have the peculiarity of distinguishing the spectral signature of the sound coming from different directions in space, thus providing a richer information as compared to that derived from single or binaural microphones. However, acoustic images are typically generated by cumbersome and costly microphone arrays which are not as widespread as ordinary microphones. This paper shows that it is still possible to generate acoustic images from off-the-shelf cameras equipped with only a single microphone and how they can be exploited for audio-visual scene understanding. We propose three architectures inspired by Variational Autoencoder, U-Net and adversarial models, and we assess their advantages and drawbacks. Such models are trained to generate spatialized audio by conditioning them to the associated video sequence and its corresponding monaural audio track. Our models are trained using the data collected by a microphone array as ground truth. Thus they learn to mimic the output of an array of microphones in the very same conditions. We assess the quality of the generated acoustic images considering standard generation metrics and different downstream tasks (classification, cross-modal retrieval and sound localization). We also evaluate our proposed models by considering multimodal datasets containing acoustic images, as well as datasets containing just monaural audio signals and RGB video frames. In all of the addressed downstream tasks we obtain notable performances using the generated acoustic data, when compared to the state of the art and to the results obtained using real acoustic images as input.</description><subject>Acoustics</subject><subject>Arrays</subject><subject>Audio data</subject><subject>Audio signals</subject><subject>Datasets</subject><subject>Generative adversarial networks</subject><subject>Image classification</subject><subject>Image processing</subject><subject>Image quality</subject><subject>Image reconstruction</subject><subject>Location awareness</subject><subject>Microphones</subject><subject>Quality assessment</subject><subject>Scene analysis</subject><subject>Sound localization</subject><subject>Spectral signatures</subject><subject>Task analysis</subject><subject>Training</subject><subject>Visualization</subject><issn>1057-7149</issn><issn>1941-0042</issn><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2022</creationdate><recordtype>article</recordtype><sourceid>RIE</sourceid><recordid>eNpdkM1LAzEQxYMoWqt3wcuCFy9bM0mabI6l-FEoKGi9LtlkViNttia7gv-9KRUPnubB_N7M4xFyAXQCQPXNy-JpwihjE85AM1YdkBFoASWlgh1mTaeqVCD0CTlN6YNSEFOQx-SESy5kJdmIrFYhDVuMXz6hK56_Q_-OvbfFzHZD2onFxrxhcY8Bo-l9F4q2i8VscL4rX30azLp4tnlZrILDmHoTnA9vZ-SoNeuE579zTFZ3ty_zh3L5eL-Yz5al5Uz0pZSu5Y0w1jHRKMmkBWBOV45WlWyhabR0SBWbToE1UnHlLJiKq1a1stEG-Zhc7-9uY_c5YOrrjU8W12sTMOevmeICpKJaZvTqH_rRDTHkdDtK5ocCRKbonrKxSyliW2-j35j4XQOtd5XXufJ6V3n9W3m2XO4tHhH_cK0F03n7A6pfepQ</recordid><startdate>2022</startdate><enddate>2022</enddate><creator>Sanguineti, Valentina</creator><creator>Morerio, Pietro</creator><creator>Del Bue, Alessio</creator><creator>Murino, Vittorio</creator><general>IEEE</general><general>The Institute of Electrical and Electronics Engineers, Inc. (IEEE)</general><scope>97E</scope><scope>RIA</scope><scope>RIE</scope><scope>AAYXX</scope><scope>CITATION</scope><scope>7SC</scope><scope>7SP</scope><scope>8FD</scope><scope>JQ2</scope><scope>L7M</scope><scope>L~C</scope><scope>L~D</scope><scope>7X8</scope><orcidid>https://orcid.org/0000-0001-5259-1496</orcidid><orcidid>https://orcid.org/0000-0001-7995-6205</orcidid><orcidid>https://orcid.org/0000-0002-8645-2328</orcidid><orcidid>https://orcid.org/0000-0002-2262-4872</orcidid></search><sort><creationdate>2022</creationdate><title>Unsupervised Synthetic Acoustic Image Generation for Audio-Visual Scene Understanding</title><author>Sanguineti, Valentina ; Morerio, Pietro ; Del Bue, Alessio ; Murino, Vittorio</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-c324t-66df3b4acd24b7626c112d98d0886f1bb96de0725512b6737dc1a837f7f6b9ae3</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2022</creationdate><topic>Acoustics</topic><topic>Arrays</topic><topic>Audio data</topic><topic>Audio signals</topic><topic>Datasets</topic><topic>Generative adversarial networks</topic><topic>Image classification</topic><topic>Image processing</topic><topic>Image quality</topic><topic>Image reconstruction</topic><topic>Location awareness</topic><topic>Microphones</topic><topic>Quality assessment</topic><topic>Scene analysis</topic><topic>Sound localization</topic><topic>Spectral signatures</topic><topic>Task analysis</topic><topic>Training</topic><topic>Visualization</topic><toplevel>peer_reviewed</toplevel><toplevel>online_resources</toplevel><creatorcontrib>Sanguineti, Valentina</creatorcontrib><creatorcontrib>Morerio, Pietro</creatorcontrib><creatorcontrib>Del Bue, Alessio</creatorcontrib><creatorcontrib>Murino, Vittorio</creatorcontrib><collection>IEEE All-Society Periodicals Package (ASPP) 2005-present</collection><collection>IEEE All-Society Periodicals Package (ASPP) 1998-Present</collection><collection>IEEE Electronic Library (IEL)</collection><collection>CrossRef</collection><collection>Computer and Information Systems Abstracts</collection><collection>Electronics & Communications Abstracts</collection><collection>Technology Research Database</collection><collection>ProQuest Computer Science Collection</collection><collection>Advanced Technologies Database with Aerospace</collection><collection>Computer and Information Systems Abstracts Academic</collection><collection>Computer and Information Systems Abstracts Professional</collection><collection>MEDLINE - Academic</collection><jtitle>IEEE transactions on image processing</jtitle></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Sanguineti, Valentina</au><au>Morerio, Pietro</au><au>Del Bue, Alessio</au><au>Murino, Vittorio</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Unsupervised Synthetic Acoustic Image Generation for Audio-Visual Scene Understanding</atitle><jtitle>IEEE transactions on image processing</jtitle><stitle>TIP</stitle><date>2022</date><risdate>2022</risdate><volume>31</volume><spage>1</spage><epage>1</epage><pages>1-1</pages><issn>1057-7149</issn><eissn>1941-0042</eissn><coden>IIPRE4</coden><abstract>Acoustic images are an emergent data modality for multimodal scene understanding. Such images have the peculiarity of distinguishing the spectral signature of the sound coming from different directions in space, thus providing a richer information as compared to that derived from single or binaural microphones. However, acoustic images are typically generated by cumbersome and costly microphone arrays which are not as widespread as ordinary microphones. This paper shows that it is still possible to generate acoustic images from off-the-shelf cameras equipped with only a single microphone and how they can be exploited for audio-visual scene understanding. We propose three architectures inspired by Variational Autoencoder, U-Net and adversarial models, and we assess their advantages and drawbacks. Such models are trained to generate spatialized audio by conditioning them to the associated video sequence and its corresponding monaural audio track. Our models are trained using the data collected by a microphone array as ground truth. Thus they learn to mimic the output of an array of microphones in the very same conditions. We assess the quality of the generated acoustic images considering standard generation metrics and different downstream tasks (classification, cross-modal retrieval and sound localization). We also evaluate our proposed models by considering multimodal datasets containing acoustic images, as well as datasets containing just monaural audio signals and RGB video frames. In all of the addressed downstream tasks we obtain notable performances using the generated acoustic data, when compared to the state of the art and to the results obtained using real acoustic images as input.</abstract><cop>New York</cop><pub>IEEE</pub><pmid>36346862</pmid><doi>10.1109/TIP.2022.3219228</doi><tpages>1</tpages><orcidid>https://orcid.org/0000-0001-5259-1496</orcidid><orcidid>https://orcid.org/0000-0001-7995-6205</orcidid><orcidid>https://orcid.org/0000-0002-8645-2328</orcidid><orcidid>https://orcid.org/0000-0002-2262-4872</orcidid></addata></record>
fulltext	fulltext_linktorsrc
identifier	ISSN: 1057-7149
ispartof	IEEE transactions on image processing, 2022, Vol.31, p.1-1
issn	1057-7149 1941-0042
language	eng
recordid	cdi_ieee_primary_9942928
source	IEEE Electronic Library (IEL)
subjects	Acoustics Arrays Audio data Audio signals Datasets Generative adversarial networks Image classification Image processing Image quality Image reconstruction Location awareness Microphones Quality assessment Scene analysis Sound localization Spectral signatures Task analysis Training Visualization
title	Unsupervised Synthetic Acoustic Image Generation for Audio-Visual Scene Understanding
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2025-01-30T11%3A09%3A34IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-proquest_RIE&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Unsupervised%20Synthetic%20Acoustic%20Image%20Generation%20for%20Audio-Visual%20Scene%20Understanding&rft.jtitle=IEEE%20transactions%20on%20image%20processing&rft.au=Sanguineti,%20Valentina&rft.date=2022&rft.volume=31&rft.spage=1&rft.epage=1&rft.pages=1-1&rft.issn=1057-7149&rft.eissn=1941-0042&rft.coden=IIPRE4&rft_id=info:doi/10.1109/TIP.2022.3219228&rft_dat=%3Cproquest_RIE%3E2734167096%3C/proquest_RIE%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_pqid=2736886414&rft_id=info:pmid/36346862&rft_ieee_id=9942928&rfr_iscdi=true