$Del Visual al Auditivo: Sonorizaci\'on de Escenas Guiada por Imagen$

Del Visual al Auditivo: Sonorizaci\'on de Escenas Guiada por Imagen

Recent advances in image, video, text and audio generative techniques, and their use by the general public, are leading to new forms of content generation. Usually, each modality was approached separately, which poses limitations. The automatic sound recording of visual sequences is one of the great...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Sánchez, María, Fernández, Laura, Arias, Julián, Cámara, Mateo, Comini, Giulia, Gabrys, Adam, Blanco, José Luis, Godino, Juan Ignacio, Hernández, Luis Alfonso
Format:	Artikel
Sprache:	eng
Schlagworte:	Computer Science - Sound
Online-Zugang:	Volltext bestellen
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

container_end_page
container_issue
container_start_page
container_title
container_volume
creator	Sánchez, María Fernández, Laura Arias, Julián Cámara, Mateo Comini, Giulia Gabrys, Adam Blanco, José Luis Godino, Juan Ignacio Hernández, Luis Alfonso
description	Recent advances in image, video, text and audio generative techniques, and their use by the general public, are leading to new forms of content generation. Usually, each modality was approached separately, which poses limitations. The automatic sound recording of visual sequences is one of the greatest challenges for the automatic generation of multimodal content. We present a processing flow that, starting from images extracted from videos, is able to sound them. We work with pre-trained models that employ complex encoders, contrastive learning, and multiple modalities, allowing complex representations of the sequences for their sonorization. The proposed scheme proposes different possibilities for audio mapping and text guidance. We evaluated the scheme on a dataset of frames extracted from a commercial video game and sounds extracted from the Freesound platform. Subjective tests have evidenced that the proposed scheme is able to generate and assign audios automatically and conveniently to images. Moreover, it adapts well to user preferences, and the proposed objective metrics show a high correlation with the subjective ratings.
doi_str_mv	10.48550/arxiv.2402.01385
format	Article
fullrecord	<record><control><sourceid>arxiv_GOX</sourceid><recordid>TN_cdi_arxiv_primary_2402_01385</recordid><sourceformat>XML</sourceformat><sourcesystem>PC</sourcesystem><sourcerecordid>2402_01385</sourcerecordid><originalsourceid>FETCH-LOGICAL-a675-42058397ecb793aea73d489c1f7dee5ea030c67da5c141b490ab30caec3df5513</originalsourceid><addsrcrecordid>eNotz8FrwjAYBfBcdhjOP2Cn5eapXdLkM-1u0qkThB0mnoTyNfkqgdpIuorur5_TwYMH7_Dgx9izFKnOAcQrxrM_pZkWWSqkyuGRle_U8q3vB2z5NbPB-W9_Cm_8K3Qh-h-0fjcJHXfE572lDnu-HDw65McQ-eqAe-qe2EODbU_j_x6xzWK-KT-S9edyVc7WCU4NJDoTkKvCkK1NoZDQKKfzwsrGOCIgFErYqXEIVmpZ60JgfV2QrHINgFQj9nK_vSmqY_QHjJfqT1PdNOoXV39ELw</addsrcrecordid><sourcetype>Open Access Repository</sourcetype><iscdi>true</iscdi><recordtype>article</recordtype></control><display><type>article</type><title>Del Visual al Auditivo: Sonorizaci\'on de Escenas Guiada por Imagen</title><source>arXiv.org</source><creator>Sánchez, María ; Fernández, Laura ; Arias, Julián ; Cámara, Mateo ; Comini, Giulia ; Gabrys, Adam ; Blanco, José Luis ; Godino, Juan Ignacio ; Hernández, Luis Alfonso</creator><creatorcontrib>Sánchez, María ; Fernández, Laura ; Arias, Julián ; Cámara, Mateo ; Comini, Giulia ; Gabrys, Adam ; Blanco, José Luis ; Godino, Juan Ignacio ; Hernández, Luis Alfonso</creatorcontrib><description>Recent advances in image, video, text and audio generative techniques, and their use by the general public, are leading to new forms of content generation. Usually, each modality was approached separately, which poses limitations. The automatic sound recording of visual sequences is one of the greatest challenges for the automatic generation of multimodal content. We present a processing flow that, starting from images extracted from videos, is able to sound them. We work with pre-trained models that employ complex encoders, contrastive learning, and multiple modalities, allowing complex representations of the sequences for their sonorization. The proposed scheme proposes different possibilities for audio mapping and text guidance. We evaluated the scheme on a dataset of frames extracted from a commercial video game and sounds extracted from the Freesound platform. Subjective tests have evidenced that the proposed scheme is able to generate and assign audios automatically and conveniently to images. Moreover, it adapts well to user preferences, and the proposed objective metrics show a high correlation with the subjective ratings.</description><identifier>DOI: 10.48550/arxiv.2402.01385</identifier><language>eng</language><subject>Computer Science - Sound</subject><creationdate>2024-02</creationdate><rights>http://creativecommons.org/licenses/by/4.0</rights><oa>free_for_read</oa><woscitedreferencessubscribed>false</woscitedreferencessubscribed></display><links><openurl>$$Topenurl_article</openurl><openurlfulltext>$$Topenurlfull_article</openurlfulltext><thumbnail>$$Tsyndetics_thumb_exl</thumbnail><link.rule.ids>228,230,781,886</link.rule.ids><linktorsrc>$$Uhttps://arxiv.org/abs/2402.01385$$EView_record_in_Cornell_University$$FView_record_in_$$GCornell_University$$Hfree_for_read</linktorsrc><backlink>$$Uhttps://doi.org/10.48550/arXiv.2402.01385$$DView paper in arXiv$$Hfree_for_read</backlink></links><search><creatorcontrib>Sánchez, María</creatorcontrib><creatorcontrib>Fernández, Laura</creatorcontrib><creatorcontrib>Arias, Julián</creatorcontrib><creatorcontrib>Cámara, Mateo</creatorcontrib><creatorcontrib>Comini, Giulia</creatorcontrib><creatorcontrib>Gabrys, Adam</creatorcontrib><creatorcontrib>Blanco, José Luis</creatorcontrib><creatorcontrib>Godino, Juan Ignacio</creatorcontrib><creatorcontrib>Hernández, Luis Alfonso</creatorcontrib><title>Del Visual al Auditivo: Sonorizaci\'on de Escenas Guiada por Imagen</title><description>Recent advances in image, video, text and audio generative techniques, and their use by the general public, are leading to new forms of content generation. Usually, each modality was approached separately, which poses limitations. The automatic sound recording of visual sequences is one of the greatest challenges for the automatic generation of multimodal content. We present a processing flow that, starting from images extracted from videos, is able to sound them. We work with pre-trained models that employ complex encoders, contrastive learning, and multiple modalities, allowing complex representations of the sequences for their sonorization. The proposed scheme proposes different possibilities for audio mapping and text guidance. We evaluated the scheme on a dataset of frames extracted from a commercial video game and sounds extracted from the Freesound platform. Subjective tests have evidenced that the proposed scheme is able to generate and assign audios automatically and conveniently to images. Moreover, it adapts well to user preferences, and the proposed objective metrics show a high correlation with the subjective ratings.</description><subject>Computer Science - Sound</subject><fulltext>true</fulltext><rsrctype>article</rsrctype><creationdate>2024</creationdate><recordtype>article</recordtype><sourceid>GOX</sourceid><recordid>eNotz8FrwjAYBfBcdhjOP2Cn5eapXdLkM-1u0qkThB0mnoTyNfkqgdpIuorur5_TwYMH7_Dgx9izFKnOAcQrxrM_pZkWWSqkyuGRle_U8q3vB2z5NbPB-W9_Cm_8K3Qh-h-0fjcJHXfE572lDnu-HDw65McQ-eqAe-qe2EODbU_j_x6xzWK-KT-S9edyVc7WCU4NJDoTkKvCkK1NoZDQKKfzwsrGOCIgFErYqXEIVmpZ60JgfV2QrHINgFQj9nK_vSmqY_QHjJfqT1PdNOoXV39ELw</recordid><startdate>20240202</startdate><enddate>20240202</enddate><creator>Sánchez, María</creator><creator>Fernández, Laura</creator><creator>Arias, Julián</creator><creator>Cámara, Mateo</creator><creator>Comini, Giulia</creator><creator>Gabrys, Adam</creator><creator>Blanco, José Luis</creator><creator>Godino, Juan Ignacio</creator><creator>Hernández, Luis Alfonso</creator><scope>AKY</scope><scope>GOX</scope></search><sort><creationdate>20240202</creationdate><title>Del Visual al Auditivo: Sonorizaci\'on de Escenas Guiada por Imagen</title><author>Sánchez, María ; Fernández, Laura ; Arias, Julián ; Cámara, Mateo ; Comini, Giulia ; Gabrys, Adam ; Blanco, José Luis ; Godino, Juan Ignacio ; Hernández, Luis Alfonso</author></sort><facets><frbrtype>5</frbrtype><frbrgroupid>cdi_FETCH-LOGICAL-a675-42058397ecb793aea73d489c1f7dee5ea030c67da5c141b490ab30caec3df5513</frbrgroupid><rsrctype>articles</rsrctype><prefilter>articles</prefilter><language>eng</language><creationdate>2024</creationdate><topic>Computer Science - Sound</topic><toplevel>online_resources</toplevel><creatorcontrib>Sánchez, María</creatorcontrib><creatorcontrib>Fernández, Laura</creatorcontrib><creatorcontrib>Arias, Julián</creatorcontrib><creatorcontrib>Cámara, Mateo</creatorcontrib><creatorcontrib>Comini, Giulia</creatorcontrib><creatorcontrib>Gabrys, Adam</creatorcontrib><creatorcontrib>Blanco, José Luis</creatorcontrib><creatorcontrib>Godino, Juan Ignacio</creatorcontrib><creatorcontrib>Hernández, Luis Alfonso</creatorcontrib><collection>arXiv Computer Science</collection><collection>arXiv.org</collection></facets><delivery><delcategory>Remote Search Resource</delcategory><fulltext>fulltext_linktorsrc</fulltext></delivery><addata><au>Sánchez, María</au><au>Fernández, Laura</au><au>Arias, Julián</au><au>Cámara, Mateo</au><au>Comini, Giulia</au><au>Gabrys, Adam</au><au>Blanco, José Luis</au><au>Godino, Juan Ignacio</au><au>Hernández, Luis Alfonso</au><format>journal</format><genre>article</genre><ristype>JOUR</ristype><atitle>Del Visual al Auditivo: Sonorizaci\'on de Escenas Guiada por Imagen</atitle><date>2024-02-02</date><risdate>2024</risdate><abstract>Recent advances in image, video, text and audio generative techniques, and their use by the general public, are leading to new forms of content generation. Usually, each modality was approached separately, which poses limitations. The automatic sound recording of visual sequences is one of the greatest challenges for the automatic generation of multimodal content. We present a processing flow that, starting from images extracted from videos, is able to sound them. We work with pre-trained models that employ complex encoders, contrastive learning, and multiple modalities, allowing complex representations of the sequences for their sonorization. The proposed scheme proposes different possibilities for audio mapping and text guidance. We evaluated the scheme on a dataset of frames extracted from a commercial video game and sounds extracted from the Freesound platform. Subjective tests have evidenced that the proposed scheme is able to generate and assign audios automatically and conveniently to images. Moreover, it adapts well to user preferences, and the proposed objective metrics show a high correlation with the subjective ratings.</abstract><doi>10.48550/arxiv.2402.01385</doi><oa>free_for_read</oa></addata></record>
fulltext	fulltext_linktorsrc
identifier	DOI: 10.48550/arxiv.2402.01385
ispartof
issn
language	eng
recordid	cdi_arxiv_primary_2402_01385
source	arXiv.org
subjects	Computer Science - Sound
title	Del Visual al Auditivo: Sonorizaci\'on de Escenas Guiada por Imagen
url	https://sfx.bib-bvb.de/sfx_tum?ctx_ver=Z39.88-2004&ctx_enc=info:ofi/enc:UTF-8&ctx_tim=2024-12-16T05%3A20%3A48IST&url_ver=Z39.88-2004&url_ctx_fmt=infofi/fmt:kev:mtx:ctx&rfr_id=info:sid/primo.exlibrisgroup.com:primo3-Article-arxiv_GOX&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.genre=article&rft.atitle=Del%20Visual%20al%20Auditivo:%20Sonorizaci%5C'on%20de%20Escenas%20Guiada%20por%20Imagen&rft.au=S%C3%A1nchez,%20Mar%C3%ADa&rft.date=2024-02-02&rft_id=info:doi/10.48550/arxiv.2402.01385&rft_dat=%3Carxiv_GOX%3E2402_01385%3C/arxiv_GOX%3E%3Curl%3E%3C/url%3E&disable_directlink=true&sfx.directlink=off&sfx.report_link=0&rft_id=info:oai/&rft_id=info:pmid/&rfr_iscdi=true