Del Visual al Auditivo: Sonorizaci\'on de Escenas Guiada por Imagen
Recent advances in image, video, text and audio generative techniques, and their use by the general public, are leading to new forms of content generation. Usually, each modality was approached separately, which poses limitations. The automatic sound recording of visual sequences is one of the great...
Gespeichert in:
Hauptverfasser: | , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Recent advances in image, video, text and audio generative techniques, and
their use by the general public, are leading to new forms of content
generation. Usually, each modality was approached separately, which poses
limitations. The automatic sound recording of visual sequences is one of the
greatest challenges for the automatic generation of multimodal content. We
present a processing flow that, starting from images extracted from videos, is
able to sound them. We work with pre-trained models that employ complex
encoders, contrastive learning, and multiple modalities, allowing complex
representations of the sequences for their sonorization. The proposed scheme
proposes different possibilities for audio mapping and text guidance. We
evaluated the scheme on a dataset of frames extracted from a commercial video
game and sounds extracted from the Freesound platform. Subjective tests have
evidenced that the proposed scheme is able to generate and assign audios
automatically and conveniently to images. Moreover, it adapts well to user
preferences, and the proposed objective metrics show a high correlation with
the subjective ratings. |
---|---|
DOI: | 10.48550/arxiv.2402.01385 |