Mel-Refine: A Plug-and-Play Approach to Refine Mel-Spectrogram in Audio Generation
Text-to-audio (TTA) model is capable of generating diverse audio from textual prompts. However, most mainstream TTA models, which predominantly rely on Mel-spectrograms, still face challenges in producing audio with rich content. The intricate details and texture required in Mel-spectrograms for suc...
Gespeichert in:
Hauptverfasser: | , , , , , , , , , , , |
---|---|
Format: | Artikel |
Sprache: | eng |
Schlagworte: | |
Online-Zugang: | Volltext bestellen |
Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
Zusammenfassung: | Text-to-audio (TTA) model is capable of generating diverse audio from textual
prompts. However, most mainstream TTA models, which predominantly rely on
Mel-spectrograms, still face challenges in producing audio with rich content.
The intricate details and texture required in Mel-spectrograms for such audio
often surpass the models' capacity, leading to outputs that are blurred or lack
coherence. In this paper, we begin by investigating the critical role of U-Net
in Mel-spectrogram generation. Our analysis shows that in U-Net structure,
high-frequency components in skip-connections and the backbone influence
texture and detail, while low-frequency components in the backbone are critical
for the diffusion denoising process. We further propose ``Mel-Refine'', a
plug-and-play approach that enhances Mel-spectrogram texture and detail by
adjusting different component weights during inference. Our method requires no
additional training or fine-tuning and is fully compatible with any
diffusion-based TTA architecture. Experimental results show that our approach
boosts performance metrics of the latest TTA model Tango2 by 25\%,
demonstrating its effectiveness. |
---|---|
DOI: | 10.48550/arxiv.2412.08577 |